kernel leaking TCP_MEM

Bug #2037335 reported by Terra Field
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux-meta-aws-6.2 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

We are running our Kafka brokers on Jammy on ARM64. Previous they were on kernel version 5.15.0-1028-aws, but a few weeks ago we built a new AMI and it picked up 6.2.0-1009-aws, and we have also upgraded to 6.2.0-1012-aws and found the same problem.

What we expected to happen:
TCP memory (TCP_MEM) to fluctuate but stay relatively low (on a busy production broker running 5.15.0-1028-aws, we average 1900 pages over a 24 hour period)

What happened instead:
TCP memory (TCP_MEM) continues to rise until hitting the limit (1.5 million pages as configured currently). At this point, the broker is no longer able to properly create new connections and we start seeing "kernel: TCP: out of memory -- consider tuning tcp_mem" in dmesg output. If allowed to continue, the broker will eventually isolate itself from the rest of the cluster since it can't talk to the other brokers.

Attached is a graph of the average TCP memory usage per kernel version for our production environment over the past 24 hours.

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: linux-aws 6.2.0.1012.12~22.04.1
ProcVersionSignature: Ubuntu 6.2.0-1012.12~22.04.1-aws 6.2.16
Uname: Linux 6.2.0-1012-aws aarch64
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: arm64
CasperMD5CheckResult: unknown
CloudArchitecture: aarch64
CloudID: aws
CloudName: aws
CloudPlatform: ec2
CloudRegion: us-east-1
CloudSubPlatform: metadata (http://169.254.169.254)
Date: Mon Sep 25 20:56:02 2023
Ec2AMI: ami-0b9c5aafc5b2a4725
Ec2AMIManifest: (unknown)
Ec2Architecture: arm64
Ec2AvailabilityZone: us-east-1b
Ec2Imageid: ami-0b9c5aafc5b2a4725
Ec2InstanceType: im4gn.4xlarge
Ec2Instancetype: im4gn.4xlarge
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
Ec2Region: us-east-1
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-meta-aws-6.2
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Terra Field (terrable42) wrote :
Revision history for this message
Terra Field (terrable42) wrote :

Rebuilt the AMI with 5.15.0-1045-aws and the problem is gone.

Revision history for this message
Jonathan Heathcote (mossblaser) wrote :

Hello there,

I think this may be the same issue as https://bugs.launchpad.net/ubuntu/+source/linux-signed-aws-6.2/+bug/2045560

I believe this might be related to the following kernel bug which impacts Linux 6.0.0+:

https://<email address hidden>/

A patch has been produced which fixes this issue (but has not yet made it into a Linux release):

https://<email address hidden>/

Hope this helps!

Jonathan

Revision history for this message
Terra Field (terrable42) wrote :

Thank you, that is great to hear!

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-meta-aws-6.2 (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.