isolcpus are ignored when using cgroups V2, causing processes to have wrong affinity

Bug #2076957 reported by Matthew Ruffell
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Committed
Medium
Matthew Ruffell

Bug Description

BugLink: https://bugs.launchpad.net/bugs/2076957

[Impact]

In latency sensitive environments, it is very common to use isolcpus to reserve a set of cpus that no other processes are to be placed on, and run just dpdk in poll mode.

There is a bug in the jammy kernel, where if cgroups V2 are enabled, after several minutes the kernel will place other processes onto these reserved isolcpus at random. This disturbs dpdk and introduces latency.

The issue does not occur with cgroups V1, so a workaround is to use cgroups V1 instead of V2 for the moment.

[Fix]

I arrived at this commit after a full git bisect, which fixes the issue. It landed in 6.2-rc1:

commit 7fd4da9c1584be97ffbc40e600a19cb469fd4e78
Author: Waiman Long <email address hidden>
Date: Sat Nov 12 17:19:39 2022 -0500
Subject: cgroup/cpuset: Optimize cpuset_attach() on v2
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fd4da9c1584be97ffbc40e600a19cb469fd4e78

Only the 5.15 Jammy kernel needs this fix. Focal works correctly as is.

The commit skips calls to cpuset_attach() if the underlying cpusets or memory have not changed in a cgroup, and it seems to fix the issue.

[Testcase]

Deploy a bare metal server, ideally with a number of cores, 56 should be plenty.
Use Jammy, with the 5.15 GA kernel.

1) Edit /etc/default/grub and set GRUB_CMDLINE_LINUX_DEFAULT to have
"isolcpus=4-7,32-35 rcu_nocb_poll rcu_nocbs=4-7,32-35 systemd.unified_cgroup_hierarchy=1"
2) sudo reboot
3) sudo cat /sys/devices/system/cpu/isolated
4-7,32-35
4) sudo apt install s-tui stress
5) sudo s-tui
6) htop
7) $ while true; do sudo ps -eLF | head -n 1; sudo ps -eLF | grep stress | awk -v a="4" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="5" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="6" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="7" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="32" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="33" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="34" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="35" '$9 == a {print;}'; sleep 5; done

Setup isolcpus to separate off 4-7 and 32-35, so each NUMA node has a set of isolated CPUs.

s-tui is a great frontend for stress, and it starts stress processes. All stress processes should initially be on non-isolated CPUs, confirm this with htop, that 4-7 and 32-25 are at 0% while every other cpu is at 100%.

After 3 minutes, but sometimes it takes up to 10 minutes, a stress process, or the s-tui process will be incorrectly placed onto an isolated cpu, causing it to increase in usage in htop. The while script checking ps with cpu affinities will also likely be printing the incorrectly placed process.

A test kernel is available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf391137-test

If you install it, the processes will not be placed onto the isolated cpus.

[Where problems could occur]

The patch changes how cgroups determines when cpuset_attach() should be called. cpuset_attach() is currently called very frequently in the 5.15 Jammy kernel, but most operations should be NOP due to no changes occurring in cpusets or memory in the cgroup the process is attached to. We are changing it to instead skip calling cpuset_attach() if there are no changes, which should offer a small performance increase, as well as fixing this isolcpus bug.

If a regression were to occur, it would affect cgroups V2 only, and it could cause resource limits to be applied incorrectly in the worst case.

Tags: jammy sts
Changed in linux (Ubuntu):
status: New → Fix Released
Changed in linux (Ubuntu Jammy):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Matthew Ruffell (mruffell)
description: updated
tags: added: jammy sts
Revision history for this message
Matthew Ruffell (mruffell) wrote :
Stefan Bader (smb)
Changed in linux (Ubuntu Jammy):
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.