[SRU]x86, sched: Treat Intel SNC topology as default, COD as exception

Bug #1976511 reported by Zhanglei Mao
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Apr 6 02:33:31 G292-280 kernel: [ 0.007531] ------------[ cut here ]------------
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] sched: CPU #20's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] WARNING: CPU: 20 PID: 0 at /build/linux-hwe-5.4-9Mb2g5/linux-hwe-5.4-5.4.0/arch/x86/kernel/smpboot.c:426 topology_sane.isra.9+0x6c/0x70
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] Modules linked in:
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] CPU: 20 PID: 0 Comm: swapper/20 Not tainted 5.4.0-107-generic #121~18.04.1-Ubuntu
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] Hardware name: GIGABYTE G292-280-00/MG52-G20-00, BIOS F02 10/28/2021
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] RIP: 0010:topology_sane.isra.9+0x6c/0x70
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] Code: 41 5c 5d c3 80 3d 1e 75 ba 01 00 75 ec 89 f1 41 89 d9 89 fe 45 89 e0 48 c7 c7 b8 00 b4 a6 c6 05 04 75 ba 01 01 e8 b4 c7 03 00
 <0f> 0b eb cb 0f 1f 44 00 00 55 0f b6 05 a3 8d f0 01 c6 05 9c 8d f0
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] RSP: 0000:ffffb76b58e17eb8 EFLAGS: 00010086
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffa7264e88
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] RDX: ffffffffa7264e88 RSI: 0000000000000096 RDI: 0000000000000046
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] RBP: ffffb76b58e17ec8 R08: 0000000000000000 R09: 000000000002f680
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] R10: ffff8b69bf800000 R11: 00000000000002c0 R12: 0000000000000001
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] R13: 0000000000000014 R14: 0000000000000014 R15: 0000000000000002
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] FS: 0000000000000000(0000) GS:ffff8b69bf800000(0000) knlGS:0000000000000000
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] CR2: 0000000000000000 CR3: 0000007ee1a0a001 CR4: 0000000000760ee0
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] PKRU: 00000000
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] Call Trace:
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] set_cpu_sibling_map+0x14f/0x600
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] start_secondary+0x6e/0x1c0
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] secondary_startup_64+0xa4/0xb0
Apr 6 02:33:31 G292-280 kernel: [ 0.007531] ---[ end trace 789d1f3abd3d96d4 ]---

Revision history for this message
Zhanglei Mao (zhanglei-mao) wrote :

We didn't find this on Ubuntu 20.04.4 LTS hwe-kernel ( 5.13.0-41-generic).

Both Ubuntu 20.04.4 ga-kernel and 18.04.6 hwe-kernel (5.4.0-107-generic) are found.

Revision history for this message
Zhanglei Mao (zhanglei-mao) wrote :

You can find above message in kernel.log file.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1976511

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Zhanglei Mao (zhanglei-mao) wrote : Re: kernel taint ( warning ) caused by smpboot.c: on 5.4.0

It seems very similar to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1882478/ which have been fixed on Bionic.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Checked on a machine in my lab with that CPU (Intel Xeon Platinum 8380) and this does not happen there...

So I'd initially blame something related to firmware, perhaps, but since it also resolves with the 5.13 HWE, that is a little confusing.

Can you try a different CPU in this machine with 5.4 ( like a Xeon Gold, or Silver, or a smaller Platinum?)

And does this happen on other models with the same CPU?

Revision history for this message
Andrea Righi (arighi) wrote :

@zhanglei-mao have you tried to update the kernel? Does it still happen also with a more recent 5.4?

Revision history for this message
Zachary Tahenakos (ztahenakos) wrote :

The Intel Xeon Platinum 8380 is an IceLake chip (see https://www.intel.com/content/www/us/en/products/sku/212287/intel-xeon-platinum-8380-processor-60m-cache-2-30-ghz/specifications.html). The change referenced by Zhanglei only addresses chips that are Skylake-X.

There is a change in the upstream kernel that changes how this SNC detection works to now default it on except for COD (Cluster-on-Die, SNC is a newer implementation of this tech it seems):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.19-rc1&id=2c88d45edbb89029c1190bb3b136d2602f057c98

The hwe-5.13 kernel has this change, but the 5.4 kernel does not.

Revision history for this message
Zhanglei Mao (zhanglei-mao) wrote :

Tested and results for Jeffrey's comment#5:

a. Yes, We’ve check other whitley's project (different with G292 series) on 20.04 and 18.04.
the miscellanea/kernel_taint_test fail in 5.4 kernel, PASS in 20.04 hwe 5.13 kernel.

b. G292-280 also PASS this item on the 20.04 hwe-edge with the 5.15 kernel.

Jeff Lane  (bladernr)
summary: - kernel taint ( warning ) caused by smpboot.c: on 5.4.0
+ [SRU]x86, sched: Treat Intel SNC topology as default, COD as exception
Revision history for this message
Jeff Lane  (bladernr) wrote :

@Zachary and @Andrea

So two questions, I guess...

1: What can we do? I'm unable to recreate this on my own non-Gigabyte system with that same CPU (Xeon Platinum 8380)

2: Is this a safe warning to ignore? Can we foresee possible issues with stability under load, performance

And I guess a third question, is this worth escalating upstream?

Revision history for this message
Jeff Lane  (bladernr) wrote :

Can anyone answer my questions in comment #9?

Revision history for this message
Zhanglei Mao (zhanglei-mao) wrote :

Another partner reported below which might be similar. The kernel version is 5.4.0-26-generic which is not lasted, so they was asked to upgrade and verify again.

Aug 31 18:47:45 ubuntu kernel: [ 2.999509] Call Trace:
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] set_cpu_sibling_map+0x159/0x590
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] start_secondary+0x6f/0x1c0
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] secondary_startup_64+0xa4/0xb0
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] ---[ end trace 4403443dce444e18 ]---

Revision history for this message
Zhanglei Mao (zhanglei-mao) wrote :
Download full text (3.7 KiB)

Below is full text from syslog
ug 31 18:47:45 ubuntu kernel: [ 8.758773] x86: Booting SMP configuration:
Aug 31 18:47:45 ubuntu kernel: [ 8.762673] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
Aug 31 18:47:45 ubuntu kernel: [ 8.850673] .... node #1, CPUs: #16
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] ------------[ cut here ]------------
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] sched: CPU #16's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] WARNING: CPU: 16 PID: 0 at arch/x86/kernel/smpboot.c:415 topology_sane.isra.0+0x70/0x80
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] Modules linked in:
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] CPU: 16 PID: 0 Comm: swapper/16 Not tainted 5.4.0-26-generic #30-Ubuntu
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] Hardware name: Quanta Cloud Technology Inc. QuantaEdge EGX66Y-2U/S6YQ-MB (LBG-T, RTT), BIOS 2A02 05/19/2022
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] RIP: 0010:topology_sane.isra.0+0x70/0x80
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] Code: 5d c3 80 3d dc a1 7a 01 00 75 ec 41 89 d9 45 89 e0 44 89 d9 44 89 d6 48 c7 c7 b0 06 75 94 c6 05 c0 a1 7a 01 01 e8 1b dc 03 00 <0f> 0b eb c9 66 66 2e 0f 1f 84 00 00 00 00 00 90 55 be 00 04 00 00
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] RSP: 0000:ffffa96218c07ea8 EFLAGS: 00010086
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000208
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 0000000000000046
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] RBP: ffffa96218c07eb8 R08: 0000000000000208 R09: 0000000000000010
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] R10: ffffffff94f92228 R11: ffffa96218c07d10 R12: 0000000000000001
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] R13: 0000000000000000 R14: ffff9d2c3fc10260 R15: 0000000000000000
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] FS: 0000000000000000(0000) GS:ffff9d6c3f800000(0000) knlGS:0000000000000000
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] CR2: 0000000000000000 CR3: 00000052c160a001 CR4: 0000000000760ee0
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] PKRU: 00000000
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] Call Trace:
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] set_cpu_sibling_map+0x159/0x590
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] start_secondary+0x6f/0x1c0
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] secondary_startup_64+0xa4/0xb0
Aug 31 18:47:45 ubuntu kernel: [ 2.999509] ---[ end trace 4403443dce444e18 ]---
Aug 31 18:47:45 ubuntu kernel: [ 9.047785] #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
Aug 31 18:47:45 ubuntu kernel: [ 9.138674] .... node #2, CPUs: #32
Aug 31 18:47:45 ubuntu kernel: [ ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.