ubuntu_ltp_controllers:cpuset_sched_domains: tests 3,9,11,17,19,25 report incorrect sched domain for cpu#32

Bug #1951289 reported by Stefan Bader
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kunpeng920
Fix Released
Low
dann frazier
Ubuntu-18.04
Fix Released
Undecided
dann frazier
Ubuntu-18.04-hwe
Fix Released
Undecided
dann frazier
Ubuntu-20.04
Fix Released
Undecided
dann frazier
Upstream-kernel
Fix Released
Undecided
Unassigned
ubuntu-kernel-tests
Invalid
Undecided
Unassigned
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Undecided
dann frazier
Focal
Fix Released
Undecided
dann frazier
Hirsute
Won't Fix
Undecided
Unassigned

Bug Description

[Impact]
The LTP cpuset_sched_domains test, authored by Miao Xie, fails on a Kunpeng920
server that has 4 NUMA nodes:
  https://launchpad.net/bugs/1951289

This does appear to be a real bug. /proc/schedstat displays 4 domain levels for
CPUs on 2 of the nodes, but only 3 levels for the others 2 (see below).
I assume this means the scheduler is making suboptimal decisions about
where to place/move processes.

[Test Case]
On a 128 core Kunpeng 920 system, observe that half the CPUs are missing a 3rd level scheduling domain:

ubuntu@d06-4:~$ grep domain2 /proc/schedstat | wc -l
128
ubuntu@d06-4:~$ grep domain3 /proc/schedstat | wc -l
64
ubuntu@d06-4:~$

[What Could Go Wrong]
This changes the code used for populating sched domains, so it could potentially break on other systems, leading to poor scheduling characteristics (higher latencies, lower overall throughput etc).

CVE References

Revision history for this message
dann frazier (dannf) wrote :
Download full text (3.5 KiB)

I can reproduce on scobee w/ latest LTP from git. This issue is also reproducible on the previous hirsute kernel (5.11.0-40.44), so does not appear to be a regression. scobee has 4 x 32 core NUMA nodes, totaling 128 cpus. Interestingly I could not reproduce on a system w/ the same SoC but only 3 NUMA nodes (96 cpus).

It isn't clear to me what exactly the test thinks is wrong, but I did find something interesting. I captured /proc/schedstat at the time of the failure and I noticed that half of the cpus (0-31, 64-95) have 4 sched domains (domain0-domain3), while the other half (32-63, 96-127) only include the first 3 sched domains. I'll attach the full file, but here's a snippet contrasting the entries for cpu31 and cpu32:

----------------
cpu31 0 0 0 0 0 0 21656590030 600326550 26576
domain0 00000000,00000000,00000000,ffffffff 0 0 0 [...]
domain1 00000000,00000000,ffffffff,ffffffff 0 0 0 [...]
domain2 00000000,ffffffff,ffffffff,ffffffff 0 0 0 [...]
domain3 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 [...]
cpu32 0 0 0 0 0 0 5351733990 838031470 39918
domain0 00000000,00000000,ffffffff,00000000 0 0 0 [...]
domain1 00000000,00000000,ffffffff,ffffffff 0 0 0 [...]
domain2 00000000,ffffffff,ffffffff,ffffffff 0 0 0 [...]
----------------

Note that domain3 is the domain that comprises all CPUs. If that is what the test is looking for, then it would make sense that the test would begin to fail at CPU32. I turned on sched-domain debugging (/sys/kernel/debug/sched_debug), and verified that it seems to match what I see in /proc/schedstat. Specifically, that CPU32 is not assigned a domain-3:

[18683.213478] CPU31 attaching sched-domain(s):
[18683.213480] domain-0: span=0-31 level=MC
[18683.213485] groups: 31:{ span=31 }, 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 12:{ span=12 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }, 23:{ span=23 }, 24:{ span=24 }, 25:{ span=25 }, 26:{ span=26 }, 27:{ span=27 }, 28:{ span=28 }, 29:{ span=29 }, 30:{ span=30 }
[18683.213582] domain-1: span=0-63 level=NUMA
[18683.213586] groups: 0:{ span=0-31 cap=32768 }, 32:{ span=32-63 cap=32768 }
[18683.213599] domain-2: span=0-95 level=NUMA
[18683.213604] groups: 0:{ span=0-63 cap=65536 }, 64:{ span=64-95 cap=32768 }
[18683.213615] domain-3: span=0-127 level=NUMA
[18683.213622] groups: 0:{ span=0-95 mask=0-31 cap=98304 }, 96:{ span=64-127 mask=96-127 cap=65536 }
[18683.213655] CPU32 attaching sched-domain(s):
[18683.213658] domain-0: span=32-63 level=MC
[18683.213662] groups: 32:{ span=32 }, 33:{ span=33 }, 34:{ span=34 }, 35:{ span=35 }, 36:{ span=36 }, 37:{ span=37 }, 38:{ span=38 }, 39:{ span=39 }, 40:{ span=40 }, 41:{ span=41 }, 42:{ span=42 }, 43:{ span=43 }, 44:{ span=44 }, 45:{ span=45 }, 46:{ span=46 }, 47:{ span=47 }, 48:{ span=48 }, 49:{ span=49 }, 50:{ span=50 }, 51:{ span=51 }, 52:{ span=52 }, 53:{ span=53 }, 54:{ span=54 }, 55:{ span=55 }, 56:{ span=56 }, 57:{ span=57 }, 58:{ span=58 }, 59:{ span=59 },...

Read more...

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Same issue found with bionic/linux 4.15.0-163.171 running on the same node.

tags: added: 4.15 bionic
Revision history for this message
dann frazier (dannf) wrote :

I found that I could reproduce this with mainline 5.11, but not mainline 5.13. I bisected it to find the change that fixed it, and hit:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=620a6dc40754dc218f5b6389b5d335e9a107fd29

This cherry-picks back to hirsute, and I verified that it solves the problem. I haven't tried going back further, but I can. It seems like we'd want to submit this to stable, but I don't feel I'm able to sufficiently provide justification as the test itself isn't clear to me.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1951289

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Bionic):
status: New → Incomplete
Changed in linux (Ubuntu Focal):
status: New → Incomplete
Changed in linux (Ubuntu Hirsute):
status: New → Incomplete
Revision history for this message
dann frazier (dannf) wrote :

The fix above also cherry-picks back to bionic, but strangely it causes the bionic kernel to fail to boot. I don't see any kernel messages after the EFI stub. I tried adding "earlycon" to get more debug info, but that somehow avoids the problem and boots fine w/ the fix. With earlycon, I can verify that the LTP test now passes. There's just something else missing.

Changed in kunpeng920:
importance: Undecided → Low
assignee: nobody → dann frazier (dannf)
Revision history for this message
dann frazier (dannf) wrote :
Download full text (5.5 KiB)

I came back to this and found that I now can get a failure w/ error messages when applying the fix (see comment #4) to bionic - see crash log below. So, I figured I could just bisect between v4.15 and v5.11 upstream w/ the fix applied and and figure out what other change(s) are required to avoid the crash. Unfortunately, I hit a kernel 5.0.0-rc5+ where the same build sometimes crashes (w/ the below backtrace) and sometimes boots fine. So it seems as though there maybe an underlying race. If that race is truly fixed in newer kernels, bisection will probably not be the best tool to find the fix since the failure case isn't 100% reproducible.

== bionic kernel w/ patch applied ==
[ 12.160242] CPU: All CPU(s) started at EL2
[ 12.165438] alternatives: patching kernel code
[ 12.186187] Unable to handle kernel paging request at virtual address 8dcaae1e1004
[ 12.194589] Mem abort info:
[ 12.197676] ESR = 0x96000004
[ 12.201055] Exception class = DABT (current EL), IL = 32 bits
[ 12.207619] SET = 0, FnV = 0
[ 12.210996] EA = 0, S1PTW = 0
[ 12.214471] Data abort info:
[ 12.217654] ISV = 0, ISS = 0x00000004
[ 12.221902] CM = 0, WnR = 0
[ 12.225186] [00008dcaae1e1004] user address but active_mm is swapper
[ 12.232238] Internal error: Oops: 96000004 [#1] SMP
[ 12.237644] Modules linked in:
[ 12.241026] Process swapper/0 (pid: 1, stack limit = 0x (ptrval))
[ 12.248459] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.15.18+ #1
[ 12.255216] pstate: 80800009 (Nzcv daif -PAN +UAO)
[ 12.260531] pc : build_sched_domains+0xb04/0xfd0
[ 12.265651] lr : build_sched_domains+0xae0/0xfd0
[ 12.270768] sp : ffff00000843bd20
[ 12.274434] x29: ffff00000843bd20 x28: ffffc5dfb98c0f80
[ 12.280320] x27: 00000000ffffffff x26: ffff3815115f2000
[ 12.286211] x25: 0000000000000100 x24: 0000000000000000
[ 12.292102] x23: ffff381511d69894 x22: ffffc5dfb9891600
[ 12.297988] x21: ffff381511d68e38 x20: ffffe5ffbb5fd200
[ 12.303880] x19: 0000000000000000 x18: ffffc5dfbfaec188
[ 12.309767] x17: 000000004cae2fed x16: 00000000804179ac
[ 12.315658] x15: 00000000bcf71eef x14: 0000000085f50aeb
[ 12.321546] x13: 0000000021ce98a4 x12: 00000000ffffff80
[ 12.327433] x11: ffff7f97feee5500 x10: 00000000fb44ed3c
[ 12.333319] x9 : 0000000000003b1b x8 : 0000000000000000
[ 12.339205] x7 : ffffc5dfbe007c00 x6 : 0000000000000002
[ 12.345098] x5 : ffffffffffffffff x4 : 0000000000000000
[ 12.350986] x3 : 0000000000000000 x2 : 00008dcaae1e1000
[ 12.356871] x1 : 0000000000000004 x0 : 0000000000000004
[ 12.362761] Call trace:
[ 12.365463] build_sched_domains+0xb04/0xfd0
[ 12.370196] sched_init_domains+0x88/0xb0
[ 12.374640] sched_init_smp+0x3c/0x90
[ 12.378696] kernel_init_freeable+0xf4/0x240
[ 12.383432] kernel_init+0x1c/0x114
[ 12.387294] ret_from_fork+0x10/0x18
[ 12.391254] Code: b4000201 93407e78 aa0103e0 f8787aa2 (f8626800)
[ 12.398067] ---[ end trace a7ac5adb59ec4af4 ]---
[ 12.403191] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 12.403191]

== kernel that sometimes boots OK w/ fix applied, sometimes doesn't ==
[ 11.975494] alternatives: patching k...

Read more...

Revision history for this message
dann frazier (dannf) wrote :
Download full text (4.1 KiB)

Here's a decoded backtrace of the 5.0-rc5+ crash (commit 41ceb5e8 w/ the fix from commit 620a6dc4075 applied), which looks quite plausible.

static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
                                 const struct cpumask *cpu_map)
{
[...]
1196 case sa_sd_storage:
1197 __sdt_free(cpu_map);
                /* Fall through */
[...]
}

static void __sdt_free(const struct cpumask *cpu_map)
{
[...]
1781 if (sdd->sd) {
1782 sd = *per_cpu_ptr(sdd->sd, j); <<< crash here
[...]
}

static int
build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *a\
ttr)
{
[...]
error:
1989 __free_domain_allocs(&d, alloc_state, cpu_map);
1990
1991 return ret;
}

[ 11.975494] alternatives: patching kernel code
[ 11.985402] Unable to handle kernel paging request at virtual address 000067
44c1718004
[ 11.994200] Mem abort info:
[ 11.997287] ESR = 0x96000004
[ 12.000667] Exception class = DABT (current EL), IL = 32 bits
[ 12.007236] SET = 0, FnV = 0
[ 12.010617] EA = 0, S1PTW = 0
[ 12.014092] Data abort info:
[ 12.017278] ISV = 0, ISS = 0x00000004
[ 12.021528] CM = 0, WnR = 0
[ 12.024810] [00006744c1718004] user address but active_mm is swapper
[ 12.031859] Internal error: Oops: 96000004 [#1] SMP
[ 12.037266] Modules linked in:
[ 12.040648] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5+ #7
[ 12.047601] pstate: 80800009 (Nzcv daif -PAN +UAO)
[ 12.052917] pc : build_sched_domains (/home/ubuntu/linux/kernel/sched/topology.c:1782 /home/ubuntu/linux/kernel/sched/topology.c:1197 /home/ubuntu/linux/kernel/sched/topology.c:1989)
[ 12.058133] lr : build_sched_domains (/home/ubuntu/linux/kernel/sched/topology.c:1778 /home/ubuntu/linux/kernel/sched/topology.c:1197 /home/ubuntu/linux/kernel/sched/topology.c:1989)
[ 12.063342] sp : ffff00001043bcf0
[ 12.067011] x29: ffff00001043bcf0 x28: ffffb75d3ae21a00
[ 12.072900] x27: ffff50187e5dc730 x26: ffffb75d3a806e80
[ 12.078788] x25: ffff50187e5dd3a4 x24: ffffb75d3a8077a0
[ 12.084675] x23: 0000000000000000 x22: ffff50187e5dd3a4
[ 12.090561] x21: ffff50187e5dc730 x20: ffffd77cfb981400
[ 12.096452] x19: 0000000000000000 x18: 0000000000000014
[ 12.102342] x17: 00000000c60b0fdd x16: 00000000eb2df79d
[ 12.108231] x15: 000000001a6f88f6 x14: 00000000a5b719f8
[ 12.114122] x13: 00000000006ba184 x12: 000000004b281177
[ 12.120013] x11: ffff7f5df3eebf80 x10: 00000000cf4217a7
[ 12.125901] x9 : 0000000000003570 x8 : 0000000000210d00
[ 12.131791] x7 : ffffd77cfbaee580 x6 : 0000000000000002
[ 12.137680] x5 : ffffd77d7fe741c0 x4 : ffffffffffffffff
[ 12.143571] x3 : 0000000000000000 x2 : 00006744c1718000
[ 12.149460] x1 : 0000000000000004 x0 : 0000000000000004
[ 12.155352] Process swapper/0 (pid: 1, stack limit = 0x(____ptrval____))
[ 12.162785] Call trace:
[ 12.165490] build_sched_domains (/home/ubuntu/linux/kernel/sched/topology.c:1782 /home/ubuntu/linux/kernel/sched/topology.c:1197 /home/ubuntu/linux/kernel/sched/topology.c:1989)
[ 12.170314] sched_init_domains (/home/ubuntu/linux/kernel/sched/topology.c:2064)
[ 12.174760] sched_init_smp (/home/ubuntu/linux/kernel/sched/core.c:5876)
[ 12.178812] kernel_init_freeab...

Read more...

Revision history for this message
dann frazier (dannf) wrote :

After some debugging, I realized the above is the same issue that this commit fixed upstream:

commit 71e5f6644fb2f3304fcb310145ded234a37e7cc1
Author: Dietmar Eggemann <dietmar.eggemann@XXX>
Date: Mon Feb 1 10:53:53 2021 +0100

    sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()

I've backported these fixes and submitted them to stable (5.10.y, 5.4.y & 4.19.y):

https://www.spinics.net/lists/stable/msg539011.html
https://www.spinics.net/lists/stable/msg539981.html

4.14.y's code is too different for these changes to easily apply.

I'll wait for them to bake there to shake out any regressions before submitting to Ubuntu. I expect that focal will pick up the fix from there naturally, but bionic will need an explicit submission since they won't make it into a 4.14.y release.

Revision history for this message
dann frazier (dannf) wrote :

Not a bug in the test.

Changed in linux (Ubuntu Hirsute):
status: Incomplete → Won't Fix
Changed in linux (Ubuntu):
status: Incomplete → Fix Released
Changed in linux (Ubuntu Bionic):
status: Incomplete → In Progress
Changed in linux (Ubuntu Focal):
status: Incomplete → In Progress
Changed in linux (Ubuntu Bionic):
assignee: nobody → dann frazier (dannf)
Changed in linux (Ubuntu Focal):
assignee: nobody → dann frazier (dannf)
Changed in kunpeng920:
status: New → In Progress
Changed in ubuntu-kernel-tests:
status: New → Invalid
Revision history for this message
dann frazier (dannf) wrote :

Released upstream in 5.4.183. The inclusion of the changes from that stable release is being tracked in bug 1969239.

Revision history for this message
dann frazier (dannf) wrote :

5.4.183 has now been merged into focal

Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
dann frazier (dannf)
description: updated
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/4.15.0-179.188 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
dann frazier (dannf)
description: updated
Revision history for this message
dann frazier (dannf) wrote :

= bionic verification =
ubuntu@d06-4:~$ cat /proc/version
Linux version 4.15.0-179-generic (buildd@bos02-arm64-025) (gcc version 7.5.0 (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)) #188-Ubuntu SMP Tue May 10 20:51:17 UTC 2022
ubuntu@d06-4:~$ grep domain2 /proc/schedstat | wc -l
128
ubuntu@d06-4:~$ grep domain3 /proc/schedstat | wc -l
128
ubuntu@d06-4:~$

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.15.0-184.194

---------------
linux (4.15.0-184.194) bionic; urgency=medium

  * CVE-2022-1966
    - netfilter: nf_tables: disallow non-stateful expression in sets earlier

 -- Thadeu Lima de Souza Cascardo <email address hidden> Thu, 02 Jun 2022 15:36:51 -0300

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

bug 1969239 has been marked as fix-released with 5.4.0-117.132

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Updating kunpeng920 series to match corresponding linux package status.

Changed in kunpeng920:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.