kernel Firmware Bug: TSC ADJUST differs failures during suspend

Bug #2025616 reported by Weichen Wu
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-nvidia (Ubuntu)
New
Undecided
Unassigned

Bug Description

[Summary]
Discoverd kernel error message during suspend stress test
test case id: power-management/suspend_30_cycles_with_reboots

collected log
~~~
High failures:
  s3: 180 failures
========================================
    HIGH Kernel message: [38779.612837] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5266754042. Restoring (x 3)
    HIGH Kernel message: [38839.622411] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5295340520. Restoring (x 3)
    HIGH Kernel message: [38868.467564] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5270514862. Restoring (x 3)
    HIGH Kernel message: [38897.419897] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5275834616. Restoring (x 3)
    HIGH Kernel message: [38926.135900] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5249511520. Restoring (x 3)
    HIGH Kernel message: [38955.114760] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5252454094. Restoring (x 3)
    HIGH Kernel message: [38983.860142] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5270474360. Restoring (x 3)
    HIGH Kernel message: [39012.819868] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -5264045578. Restoring (x 3)
~~~

[Failure rate]
1/1

[Additional information]
CID: 201711-25989
SKU: DGX-1 Station
system-manufacturer: NVIDIA
system-product-name: DGX Station
bios-version: 0406
CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (40x)
GPU: 07:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1db2] (rev a1)
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1db2] (rev a1)
0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1db2] (rev a1)
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1db2] (rev a1)
nvidia-driver-version: 525.105.17
kernel-version: 5.15.0-1028-nvidia

[Stage]
Issue reported and logs collected at a later stage

Revision history for this message
Weichen Wu (weichenwu) wrote :

Automatically attached

Revision history for this message
Weichen Wu (weichenwu) wrote :

Automatically attached

Revision history for this message
Weichen Wu (weichenwu) wrote :

Automatically attached

Revision history for this message
Weichen Wu (weichenwu) wrote :

Automatically attached

Revision history for this message
Weichen Wu (weichenwu) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Libera.chat.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/2025616/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
dann frazier (dannf)
affects: ubuntu → linux-nvidia (Ubuntu)
Revision history for this message
dann frazier (dannf) wrote :

I was curious why this wasn't a problem with focal/5.4. I took a look at the last cert run that used focal/5.4[*], and I see these errors in the logs as well. I then went back to the cert run that was used to award focal certification[**] and those errors do *not* appear there.

So either this is an intermittent failure, or likely one of 3 things happened in the interim:
 - The test changed (or was introduced)
 - The kernel changed
 - The firmware changed

The test does not appear to be new - I have not checked if it has changed.

The kernel version between these runs changed from 5.4.0-37.41-generic to 5.4.0-121.137-generic. A change to this kernel code was introduced in between, in 5.4.0-100.113-generic:

commit 7dcfa07b500834c75a4f5043a43f409a3f02bd5e
Author: Feng Tang <email address hidden>
Date: Wed Nov 17 10:37:50 2021 +0800

    x86/tsc: Add a timer to make sure TSC_adjust is always checked

    BugLink: https://bugs.launchpad.net/bugs/1956381

    commit c7719e79347803b8e3b6b50da8c6db410a3012b5 upstream.

That causes the code that *might* print this warning to run every 10 minutes, instead of when the CPU enters idle. But the log shows these messages appearing every 28-29 seconds, so this being the cause seems unlikely.

As for the firmware, the dmidecode output is identical between these runs, which suggests the firmware has not changed.

[*] https://certification.canonical.com/hardware/201711-25989/submission/269263/
[**] https://certification.canonical.com/hardware/201711-25989/submission/172650/

Revision history for this message
Stephen Carr (truck-adel) wrote :

I have the same problem - see attached.

Linux Lenovo-ideapad-520 6.2.0-33-generic #33~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep 7 10:33:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Oct 02 08:11:31 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is about to suspend
Oct 02 08:11:44 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is resuming
Oct 02 08:41:45 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is about to suspend
Oct 02 08:41:57 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is resuming
Oct 02 09:11:58 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is about to suspend
Oct 02 09:12:12 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is resuming
Oct 02 09:42:12 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is about to suspend
Oct 02 09:42:26 Lenovo-ideapad-520 ModemManager[1667]: <info> [sleep-monitor-systemd] system is resuming

Revision history for this message
Stephen Carr (truck-adel) wrote :

I have discovered that the bug causes Ubuntu 22.04 NOT to suspend to S3 state (deep). Setting the suspend state to S2Idle works.

Revision history for this message
K W (djstrong) wrote :
Download full text (6.0 KiB)

I have updated Ubuntu on my Lenovo Y520 from 18.04 to 23.10 (fresh install) and I have problems with suspending (freeze on suspend or wake up). I think it may be related to Nvidia.
```
[160828.853288] PM: suspend entry (deep)
[160828.854252] Filesystems sync: 0.000 seconds
[160828.882275] rfkill: input handler enabled
[160829.005936] Freezing user space processes
[160829.008956] Freezing user space processes completed (elapsed 0.003 seconds)
[160829.008962] OOM killer disabled.
[160829.008963] Freezing remaining freezable tasks
[160829.010721] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[160829.010770] printk: Suspending console(s) (use no_console_suspend to debug)
[160829.024762] wlp3s0: deauthenticating from 6c:5a:b0:c9:b8:2f by local choice (Reason: 3=DEAUTH_LEAVING)
[160829.043993] sd 2:0:0:0: [sda] Synchronizing SCSI cache
[160829.044204] sd 2:0:0:0: [sda] Stopping disk
[160829.996520] ACPI: EC: interrupt blocked
[160830.119907] ACPI: PM: Preparing to enter system sleep state S3
[160830.120226] ACPI: EC: event blocked
[160830.120227] ACPI: EC: EC stopped
[160830.120227] ACPI: PM: Saving platform NVS memory
[160830.120464] Disabling non-boot CPUs ...
[160830.121807] smpboot: CPU 1 is now offline
[160830.123807] smpboot: CPU 2 is now offline
[160830.125574] smpboot: CPU 3 is now offline
[160830.127657] smpboot: CPU 4 is now offline
[160830.129522] smpboot: CPU 5 is now offline
[160830.131306] smpboot: CPU 6 is now offline
[160830.133177] smpboot: CPU 7 is now offline
[148701.924908] [Firmware Bug]: TSC ADJUST differs: CPU0 0 --> -500554061. Restoring
[160830.137174] ACPI: PM: Low-level resume complete
[160830.137234] ACPI: EC: EC started
[160830.137234] ACPI: PM: Restoring platform NVS memory
[160830.138184] Enabling non-boot CPUs ...
[160830.138212] smpboot: Booting Node 0 Processor 1 APIC 0x2
[160830.141748] CPU1 is up
[160830.141767] smpboot: Booting Node 0 Processor 2 APIC 0x4
[160830.144344] CPU2 is up
[160830.144360] smpboot: Booting Node 0 Processor 3 APIC 0x6
[160830.147856] CPU3 is up
[160830.147871] smpboot: Booting Node 0 Processor 4 APIC 0x1
[160830.148759] CPU4 is up
[160830.148775] smpboot: Booting Node 0 Processor 5 APIC 0x3
[160830.149453] CPU5 is up
[160830.149469] smpboot: Booting Node 0 Processor 6 APIC 0x5
[160830.150163] CPU6 is up
[160830.150179] smpboot: Booting Node 0 Processor 7 APIC 0x7
[160830.150898] CPU7 is up
[160830.154911] ACPI: PM: Waking up from system sleep state S3
[160830.160550] ACPI: EC: interrupt unblocked
[160831.452670] nvidia 0000:01:00.0: Enabling HDA controller
[160831.454493] ACPI: EC: event unblocked
[160831.454557] rtlwifi: rtlwifi: wireless switch is on
[160831.464924] i915 0000:00:02.0: [drm] [ENCODER:94:DDI A/PHY A] is disabled/in DSI mode with an ungated DDI clock, gate it
[160831.464956] i915 0000:00:02.0: [drm] [ENCODER:102:DDI B/PHY B] is disabled/in DSI mode with an ungated DDI clock, gate it
[160831.464988] i915 0000:00:02.0: [drm] [ENCODER:111:DDI C/PHY C] is disabled/in DSI mode with an ungated DDI clock, gate it
[160831.479757] nvme nvme0: Shutdown timeout set to 10 seconds
[160831.480874] nvme nvme0: 8/0/0 default/read/poll queues
[160831.652005] r8169 00...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.