Stack trace booting 20.04 LTS server on system with dual Xeon Gold 6240 CPUs

Bug #1906716 reported by Jeff Lane 
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I noticed this in syslog while investigating an unrelated issue today. I have Focal installed on a Fujitsu RX2530 M5 server with two Xeon Gold 6240 18c/36t CPUs installed. Every reboot results in the following MSR stack trace:

Dec 3 17:34:31 nabbit kernel: [ 0.002463] smpboot: CPU 18 Converting physical 0 to logical die 1
Dec 3 17:34:31 nabbit kernel: [ 0.002463] unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0000000000000000) at rIP: 0xffffffff81c78b04 (native_write_msr+0x4/0x30)
Dec 3 17:34:31 nabbit kernel: [ 0.002463] Call Trace:
Dec 3 17:34:31 nabbit kernel: [ 0.002463] ? intel_pmu_cpu_starting+0x87/0x270
Dec 3 17:34:31 nabbit kernel: [ 0.002463] ? x86_pmu_dead_cpu+0x30/0x30
Dec 3 17:34:31 nabbit kernel: [ 0.002463] x86_pmu_starting_cpu+0x1a/0x30
Dec 3 17:34:31 nabbit kernel: [ 0.002463] cpuhp_invoke_callback+0x9b/0x580
Dec 3 17:34:31 nabbit kernel: [ 0.002463] notify_cpu_starting+0x66/0x80
Dec 3 17:34:31 nabbit kernel: [ 0.002463] start_secondary+0xaa/0x1c0
Dec 3 17:34:31 nabbit kernel: [ 0.002463] secondary_startup_64+0xa4/0xb0
Dec 3 17:34:31 nabbit kernel: [ 0.498575] #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35
Dec 3 17:34:31 nabbit kernel: [ 0.618576] .... node #0, CPUs: #36
Dec 3 17:34:31 nabbit kernel: [ 0.623308] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
Dec 3 17:34:31 nabbit kernel: [ 0.623308] TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.
Dec 3 17:34:31 nabbit kernel: [ 0.623308] #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53
Dec 3 17:34:31 nabbit kernel: [ 0.672450] .... node #1, CPUs: #54 #55 #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71Dec 3 17:34:31 nabbit kernel: [ 0.729432] smp: Brought up 2 nodes, 72 CPUs
Dec 3 17:34:31 nabbit kernel: [ 0.729432] smpboot: Max logical packages: 2
Dec 3 17:34:31 nabbit kernel: [ 0.729432] smpboot: Total of 72 processors activated (374479.29 BogoMIPS)

it doesn't seem to be catastrophic, but is troubling to find this in the logs.

On a different FJ server (RX2540 M5) with 2x Xeon Gold 6242 cpus (16c/32T)

This trace is not present, so this could indicate something with this particular machine, or this particular CPU model.

Here is the smp boot from the non-failing machine:

Dec 2 16:02:56 polari kernel: [ 1.522346] smpboot: CPU0: Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (family: 0x6, model: 0x55, stepping: 0x5)
Dec 2 16:02:56 polari kernel: [ 1.522575] Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver.
Dec 2 16:02:56 polari kernel: [ 1.522584] ... version: 4
Dec 2 16:02:56 polari kernel: [ 1.522585] ... bit width: 48
Dec 2 16:02:56 polari kernel: [ 1.522587] ... generic registers: 4
Dec 2 16:02:56 polari kernel: [ 1.522588] ... value mask: 0000ffffffffffff
Dec 2 16:02:56 polari kernel: [ 1.522589] ... max period: 00007fffffffffff
Dec 2 16:02:56 polari kernel: [ 1.522591] ... fixed-purpose events: 3
Dec 2 16:02:56 polari kernel: [ 1.522592] ... event mask: 000000070000000f
Dec 2 16:02:56 polari kernel: [ 1.522665] rcu: Hierarchical SRCU implementation.
Dec 2 16:02:56 polari kernel: [ 1.524965] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
Dec 2 16:02:56 polari kernel: [ 1.525875] smp: Bringing up secondary CPUs ...
Dec 2 16:02:56 polari kernel: [ 1.525990] x86: Booting SMP configuration:
Dec 2 16:02:56 polari kernel: [ 1.525992] .... node #0, CPUs: #1 #2 #3
Dec 2 16:02:56 polari kernel: [ 1.533485] .... node #1, CPUs: #4 #5 #6 #7
Dec 2 16:02:56 polari kernel: [ 1.543960] .... node #0, CPUs: #8 #9 #10 #11
Dec 2 16:02:56 polari kernel: [ 1.553544] .... node #1, CPUs: #12 #13 #14 #15
Dec 2 16:02:56 polari kernel: [ 1.564701] .... node #2, CPUs: #16
Dec 2 16:02:56 polari kernel: [ 0.002176] smpboot: CPU 16 Converting physical 0 to logical die 1
Dec 2 16:02:56 polari kernel: [ 1.651254] #17 #18 #19
Dec 2 16:02:56 polari kernel: [ 1.659278] .... node #3, CPUs: #20 #21 #22 #23
Dec 2 16:02:56 polari kernel: [ 1.669669] .... node #2, CPUs: #24 #25 #26 #27
Dec 2 16:02:56 polari kernel: [ 1.680637] .... node #3, CPUs: #28 #29 #30 #31
Dec 2 16:02:56 polari kernel: [ 1.691394] .... node #0, CPUs: #32
Dec 2 16:02:56 polari kernel: [ 1.693845] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
Dec 2 16:02:56 polari kernel: [ 1.693845] TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.
Dec 2 16:02:56 polari kernel: [ 1.693845] #33 #34 #35
Dec 2 16:02:56 polari kernel: [ 1.701504] .... node #1, CPUs: #36 #37 #38 #39
Dec 2 16:02:56 polari kernel: [ 1.712687] .... node #0, CPUs: #40 #41 #42 #43
Dec 2 16:02:56 polari kernel: [ 1.723263] .... node #1, CPUs: #44 #45 #46 #47
Dec 2 16:02:56 polari kernel: [ 1.733658] .... node #2, CPUs: #48 #49 #50 #51
Dec 2 16:02:56 polari kernel: [ 1.744372] .... node #3, CPUs: #52 #53 #54 #55
Dec 2 16:02:56 polari kernel: [ 1.755243] .... node #2, CPUs: #56 #57 #58 #59
Dec 2 16:02:56 polari kernel: [ 1.765640] .... node #3, CPUs: #60 #61 #62 #63
Dec 2 16:02:56 polari kernel: [ 1.776965] smp: Brought up 4 nodes, 64 CPUs
Dec 2 16:02:56 polari kernel: [ 1.776965] smpboot: Max logical packages: 2
Dec 2 16:02:56 polari kernel: [ 1.776965] smpboot: Total of 64 processors activated (358464.56 BogoMIPS)

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.4.0-56-generic 5.4.0-56.62
ProcVersionSignature: Ubuntu 5.4.0-56.62-generic 5.4.73
Uname: Linux 5.4.0-56-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Dec 3 20:02 seq
 crw-rw---- 1 root audio 116, 33 Dec 3 20:02 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu27.10
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CasperMD5CheckResult: skip
Date: Thu Dec 3 20:15:56 2020
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 002: ID 0424:2533 Microchip Technology, Inc. (formerly SMSC)
 Bus 001 Device 004: ID 046b:ff10 American Megatrends, Inc. Virtual Keyboard and Mouse
 Bus 001 Device 003: ID 046b:ff01 American Megatrends, Inc. Virtual Hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: FUJITSU PRIMERGY RX2530 M5
PciMultimedia:

ProcEnviron:
 TERM=screen-256color
 PATH=(custom, no user)
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB: 0 mgag200drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.4.0-56-generic root=UUID=0e82de6f-eac2-426d-b89e-e52b1acaa792 ro console=tty0
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-56-generic N/A
 linux-backports-modules-5.4.0-56-generic N/A
 linux-firmware 1.187.4
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 10/17/2019
dmi.bios.vendor: FUJITSU // American Megatrends Inc.
dmi.bios.version: V5.0.0.14 R1.15.0 for D3383-B1x
dmi.board.name: D3383-B1
dmi.board.vendor: FUJITSU
dmi.board.version: S26361-D3383-B13 WGS04 GS01
dmi.chassis.asset.tag: nabbit
dmi.chassis.type: 23
dmi.chassis.vendor: FUJITSU
dmi.chassis.version: RX2530M5R3
dmi.modalias: dmi:bvnFUJITSU//AmericanMegatrendsInc.:bvrV5.0.0.14R1.15.0forD3383-B1x:bd10/17/2019:svnFUJITSU:pnPRIMERGYRX2530M5:pvr:rvnFUJITSU:rnD3383-B1:rvrS26361-D3383-B13WGS04GS01:cvnFUJITSU:ct23:cvrRX2530M5R3:
dmi.product.family: SERVER
dmi.product.name: PRIMERGY RX2530 M5
dmi.product.sku: S26361-K1659-Vxxx
dmi.sys.vendor: FUJITSU

Revision history for this message
Jeff Lane  (bladernr) wrote :
tags: added: hwcert-server
description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.