5.15.0-30-generic : SSBD mitigation results in "unchecked MSR access error: WRMSR to 0x48 (tried to write 0x0000000000000004)" and flood of kernel traces in some cloud providers

Bug #1973839 reported by Ian Wienand
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

When booting this in one of our clouds, we see an error early in the kernel output

  kernel: unchecked MSR access error: WRMSR to 0x48 (tried to write 0x0000000000000004) at rIP: 0xffffffffabc90af4 (native_write_msr+0x4/0x20)

and then an un-ending stream of "bare" tracebacks; which I think must be related

[ 2.285717] kernel: Call Trace:
[ 2.285722] kernel: <TASK>
[ 2.285723] kernel: ? speculation_ctrl_update+0x95/0x200
[ 2.292001] kernel: speculation_ctrl_update_current+0x1f/0x30
[ 2.292011] kernel: ssb_prctl_set+0x92/0xe0
[ 2.292016] kernel: arch_seccomp_spec_mitigate+0x62/0x70
[ 2.292019] kernel: seccomp_set_mode_filter+0x4de/0x530
[ 2.292024] kernel: do_seccomp+0x37/0x1f0
[ 2.292026] kernel: __x64_sys_seccomp+0x18/0x20
[ 2.292028] kernel: do_syscall_64+0x5c/0xc0
[ 2.292035] kernel: ? handle_mm_fault+0xd8/0x2c0
[ 2.299617] kernel: ? do_user_addr_fault+0x1e3/0x670
[ 2.312878] kernel: ? exit_to_user_mode_prepare+0x37/0xb0
[ 2.312894] kernel: ? irqentry_exit_to_user_mode+0x9/0x20
[ 2.312905] kernel: ? irqentry_exit+0x19/0x30
[ 2.312907] kernel: ? exc_page_fault+0x89/0x160
[ 2.312909] kernel: ? asm_exc_page_fault+0x8/0x30
[ 2.312914] kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 2.312919] kernel: RIP: 0033:0x7fcffd6eaa3d
[ 2.312924] kernel: Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 a3 0f 00 f7 d8 64 89 01 48
[ 2.312926] kernel: RSP: 002b:00007ffe352e2938 EFLAGS: 00000246 ORIG_RAX: 000000000000013d
[ 2.312930] kernel: RAX: ffffffffffffffda RBX: 0000557d99d0c0c0 RCX: 00007fcffd6eaa3d
[ 2.319941] systemd[1]: Starting Load Kernel Module configfs...
[ 2.320103] kernel: RDX: 0000557d99c01290 RSI: 0000000000000000 RDI: 0000000000000001
[ 2.339938] kernel: RBP: 0000000000000000 R08: 0000000000000001 R09: 0000557d99c01290
[ 2.339941] kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[ 2.339942] kernel: R13: 0000000000000001 R14: 0000557d99c01290 R15: 0000000000000001
[ 2.339947] kernel: </TASK>

We never see any more warnings or context for the tracebacks, but they just keep coming over and over, filling up the logs. This is with a jammy x86_64 system running 5.15.0-30-generic

Unfortunately I don't exactly know what is behind it on the cloud side.

There seem to be several bugs that are similar but not exactly the same

https://bugzilla.redhat.com/show_bug.cgi?id=1808996
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1921880

The gist seems to be that 0x4 refers to SSBD mitigation and something about the combo of some versions of qemu and a Jammy guest kernel make the guest unhappy. I will attach cpuid info for the guest.

I installed the 5.17 kernel from the mainline repository, and this appears to go away. I will attempt to bisect it to something more specific.

Revision history for this message
Ian Wienand (iwienand) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1973839

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Ian Wienand (iwienand) wrote : Re: 5.15.0-30-generic : unchecked MSR access error: WRMSR to 0x48 (tried to write 0x0000000000000004)

I've made this confirmed, because the log collection (apport-collect 1973839) is hundreds of megabytes, as dmesg is full of the tracebacks discussed

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Ian Wienand (iwienand) wrote :

I have bisected this, and the commit that *fixes* this between the focal kernel (5.15.0-30-generic) and the current 5.17 release is

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2f46993d83ff4abb310ef7b4beced56ba96f0d9d

 x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl
 Switch the kernel default of SSBD and STIBP to the ones with
 CONFIG_SECCOMP=n (i.e. spec_store_bypass_disable=prctl
 spectre_v2_user=prctl) even if CONFIG_SECCOMP=y.

Revision history for this message
Ian Wienand (iwienand) wrote :

So after reading and experimenting a bit more, what the upstream change is doing is setting the defaults to

spec_store_bypass_disable=prctl
spectre_v2_user=prctl

instead of "seccomp". This basically means that instead of all seccomp() users setting these flags, it is up to userspace to set manually via prctl(). The linked upstream change goes into all the reasons why this is the right thing to do.

From the cpuid on the output of the failing cloud provider we see

      SSBD: speculative store bypass disable = true

suggesting that this has been explicitly disabled? It's unclear to me if that's set by the cloud provider in qemu? Not sure if I can tell from a guest without backend access?

OpenDev is a canary for this sort of thing as we are extremely heterogeneous with clouds, we have resources donated by about 7-8 different cloud providers, each with multiple regions (across x86_64 & arm64) that we use simultaneously for CI work (we use whatever people will donate). I've tested and booting with spec_store_bypass_disable=prctl stops the traces in the affected cloud, so we'll probably implement this.

However, I think there's probably enough here to think about backporting this commit for maximum compatibility of the generic images. It seems like the system works well enough (which is how it passed all our initial CI) but the traces spewing will quickly lead to disks filling up with bloated log files (how we found it after running in production).

Ian Wienand (iwienand)
summary: - 5.15.0-30-generic : unchecked MSR access error: WRMSR to 0x48 (tried to
- write 0x0000000000000004)
+ 5.15.0-30-generic : SSBD mitigation results in "unchecked MSR access
+ error: WRMSR to 0x48 (tried to write 0x0000000000000004)" and flood of
+ kernel traces in some cloud providers
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Which cloud is in use and what the instance type is?

Revision history for this message
Ian Wienand (iwienand) wrote :

> Which cloud is in use and what the instance type is?

This was seen on OVH. I don't think it's a public instance type. I have attached the cpuid of the affected guest (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1973839/+attachment/5590542/+files/cpuid) but unfortunately I don't details about what is happening behind KVM

Revision history for this message
Jens Bretschneider (jens.bretschneider.plusnet) wrote :

Same here after Updating from 20.04 to 22.04 today:

[ 0.000000] Linux version 5.15.0-46-generic (buildd@lcy02-amd64-115) (gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 (Ubuntu 5.15.0-46.49-generic 5.15.39)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-46-generic root=UUID=8eb2528b-9477-4dcd-b796-e38b25a16b14 ro console=tty1 console=ttyS0
...
[ 5.239998] unchecked MSR access error: WRMSR to 0x48 (tried to write 0x0000000000000004) at rIP: 0xffffffffbbe960f4 (native_write_msr+0x4/0x30)
[ 5.242224] Call Trace:
[ 5.242228] <TASK>
[ 5.242229] ? write_spec_ctrl_current+0x45/0x50
[ 5.242240] speculation_ctrl_update+0x8f/0x200
[ 5.245439] speculation_ctrl_update_current+0x1f/0x30
[ 5.246406] ssb_prctl_set+0x9a/0xf0
[ 5.247183] arch_seccomp_spec_mitigate+0x66/0x70
[ 5.248195] seccomp_set_mode_filter+0x4e2/0x530
[ 5.249094] do_seccomp+0x37/0x200
[ 5.249829] __x64_sys_seccomp+0x18/0x20
[ 5.250614] do_syscall_64+0x5c/0xc0
[ 5.251371] ? exit_to_user_mode_prepare+0x37/0xb0
[ 5.252268] ? irqentry_exit_to_user_mode+0x9/0x20
[ 5.253181] ? irqentry_exit+0x1d/0x30
[ 5.253932] ? exc_page_fault+0x89/0x170
[ 5.254714] entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 5.255633] RIP: 0033:0x7fd9bf82ca3d
[ 5.256398] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 a3 0f 00 f7 d8 64 89 01 48
[ 5.259150] RSP: 002b:00007fff2d4c0f88 EFLAGS: 00000246 ORIG_RAX: 000000000000013d
[ 5.260339] RAX: ffffffffffffffda RBX: 00005604b1701620 RCX: 00007fd9bf82ca3d
[ 5.261476] RDX: 00005604b16fc390 RSI: 0000000000000000 RDI: 0000000000000001
[ 5.262605] RBP: 0000000000000000 R08: 0000000000000001 R09: 00005604b16fc390
[ 5.263780] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[ 5.264940] R13: 0000000000000001 R14: 00005604b16fc390 R15: 0000000000000001
[ 5.266076] </TASK>
         Starting [0;1;39mJournal Service[0m...
[ 5.267097] Call Trace:
[ 5.267721] <TASK>
[ 5.268288] ? write_spec_ctrl_current+0x45/0x50
[ 5.269123] __switch_to_xtra+0x110/0x4e0
[ 5.269919] __switch_to+0x260/0x450
[ 5.270658] __schedule+0x23d/0x590
[ 5.271400] ? __do_softirq+0x27f/0x2e7
[ 5.272180] schedule+0x4e/0xc0
[ 5.272891] smpboot_thread_fn+0xff/0x160
[ 5.273696] ? smpboot_register_percpu_thread+0x140/0x140
[ 5.274655] kthread+0x12a/0x150
[ 5.275366] ? set_kthread_struct+0x50/0x50
[ 5.276170] ret_from_fork+0x22/0x30
[ 5.276930] </TASK>
[ 5.278397] Call Trace:

Cloud provider is https://www.ip-exchange.de/, based on OpenStack.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.