System76 Oryx Pro (oryp5) with 5.0.0-21: Fail to resume from suspend

Bug #1836630 reported by Jeremy Soller on 2019-07-15
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Unknown
linux (Ubuntu)
Undecided
Unassigned
nvidia-graphics-drivers-430 (Ubuntu)
Undecided
Alberto Milone

Bug Description

After upgrading the Ubuntu kernel to version 5.0.0-21, the System76 Oryx Pro (oryp5) fails to resume from suspend when using discrete NVIDIA graphics

The issue can be created on this hardware by following these steps:
- Install Ubuntu 18.04.2
- Add the proposed updates: https://wiki.ubuntu.com/Testing/EnableProposed
- Upgrade:
  sudo apt-get updatesudo apt-get dist-upgrade
- Install 5.0 HWE kernel:
  sudo apt-get install linux-generic-hwe-18.04-edge
- Install NVIDIA driver:
  sudo apt-get install nvidia-driver-430
- Reboot:
  sudo reboot
- Attempt suspend/resume cycle

This occurred after upgrading the kernel from version 5.0.0-20.
---
ProblemType: Bug
ApportVersion: 2.20.9-0ubuntu7.7
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: system76 1721 F.... pulseaudio
 /dev/snd/controlC0: system76 1721 F.... pulseaudio
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 18.04
MachineType: System76 Oryx Pro
NonfreeKernelModules: nvidia_modeset nvidia
Package: linux-hwe-edge
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.0.0-21-generic root=UUID=10b5d457-8884-4b50-bd82-9b38e7f36564 ro
ProcVersionSignature: Ubuntu 5.0.0-21.22~18.04.1-generic 5.0.15
RelatedPackageVersions:
 linux-restricted-modules-5.0.0-21-generic N/A
 linux-backports-modules-5.0.0-21-generic N/A
 linux-firmware 1.173.9
Tags: bionic
Uname: Linux 5.0.0-21-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm sudo
_MarkForUpload: True
dmi.bios.date: 05/07/2019
dmi.bios.vendor: INSYDE Corp.
dmi.bios.version: 1.07.08
dmi.board.asset.tag: Tag 12345
dmi.board.name: Oryx Pro
dmi.board.vendor: System76
dmi.board.version: oryp5
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 10
dmi.chassis.vendor: System76
dmi.chassis.version: oryp5
dmi.modalias: dmi:bvnINSYDECorp.:bvr1.07.08:bd05/07/2019:svnSystem76:pnOryxPro:pvroryp5:rvnSystem76:rnOryxPro:rvroryp5:cvnSystem76:ct10:cvroryp5:
dmi.product.family: Not Applicable
dmi.product.name: Oryx Pro
dmi.product.sku: Not Applicable
dmi.product.version: oryp5
dmi.sys.vendor: System76

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1836630

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected bionic
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux-hwe-edge (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Jeremy Soller (jackpot51) wrote :

I was able to narrow this down to this commit:

commit 1bde9ecf018ee646b68258921bf0fa364afda38a
Author: Keith Busch <email address hidden>
Date: Thu May 23 09:27:35 2019 -0600

    nvme-pci: Use host managed power state for suspend

    BugLink: https://bugs.launchpad.net/bugs/1808957

    The nvme pci driver prepares its devices for power loss during suspend
    by shutting down the controllers. The power setting is deferred to
    pci driver's power management before the platform removes power. The
    suspend-to-idle mode, however, does not remove power.

    NVMe devices that implement host managed power settings can achieve
    lower power and better transition latencies than using generic PCI power
    settings. Try to use this feature if the platform is not involved with
    the suspend. If successful, restore the previous power state on resume.

    Cc: Mario Limonciello <email address hidden>
    Cc: Kai Heng Feng <email address hidden>
    Tested-by: Kai-Heng Feng <email address hidden>
    Tested-by: Mario Limonciello <email address hidden>
    Reviewed-by: Rafael J. Wysocki <email address hidden>
    Reviewed-by: Christoph Hellwig <email address hidden>
    Signed-off-by: Keith Busch <email address hidden>
    Signed-off-by: Sagi Grimberg <email address hidden>
    (cherry picked from commit a0805317252ad9cf09d4a32b0435e165580adf8a)
    Signed-off-by: Kai-Heng Feng <email address hidden>
    Acked-by: Timo Aaltonen <email address hidden>
    Acked-by: Stefan Bader <email address hidden>
    Signed-off-by: Kleber Sacilotto de Souza <email address hidden>

After the application of nvme-pci: use host managed power state for suspend in d916b1be94b6dc8d293abed2451f3062f6af7551, the System76 Oryx Pro (oryp5) has trouble resuming from suspend.

Also reported in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1836630

Jeremy Soller (jackpot51) wrote :

It looks like reverting 1bde9ecf018ee646b68258921bf0fa364afda38a fixes suspend issues when writing devices to /sys/power/pm_test, but also reverting 6c6bc1aee6a61fe7a98196794735849286761a4a fixes suspend issues when using standard suspend/resume:

commit 6c6bc1aee6a61fe7a98196794735849286761a4a
Author: Rafael J. Wysocki <email address hidden>
Date: Thu Jun 13 23:59:45 2019 +0200

    PCI: PM: Skip devices in D0 for suspend-to-idle

    BugLink: https://bugs.launchpad.net/bugs/1808957

    Patchwork: https://patchwork.kernel.org/patch/10993697/

    Commit d491f2b75237 ("PCI: PM: Avoid possible suspend-to-idle issue")
    attempted to avoid a problem with devices whose drivers want them to
    stay in D0 over suspend-to-idle and resume, but it did not go as far
    as it should with that.

    Namely, first of all, the power state of a PCI bridge with a
    downstream device in D0 must be D0 (based on the PCI PM spec r1.2,
    sec 6, table 6-1, if the bridge is not in D0, there can be no PCI
    transactions on its secondary bus), but that is not actively enforced
    during system-wide PM transitions, so use the skip_bus_pm flag
    introduced by commit d491f2b75237 for that.

    Second, the configuration of devices left in D0 (whatever the reason)
    during suspend-to-idle need not be changed and attempting to put them
    into D0 again by force is pointless, so explicitly avoid doing that.

    Fixes: d491f2b75237 ("PCI: PM: Avoid possible suspend-to-idle issue")
    Reported-by: Kai-Heng Feng <email address hidden>
    Signed-off-by: Rafael J. Wysocki <email address hidden>
    Reviewed-by: Mika Westerberg <email address hidden>
    Tested-by: Kai-Heng Feng <email address hidden>
    (cherry picked from commit 3e26c5feed2add218046ecf91bab3cfa9bf762a6)
    Signed-off-by: Kai-Heng Feng <email address hidden>
    Acked-by: Timo Aaltonen <email address hidden>
    Acked-by: Stefan Bader <email address hidden>
    Signed-off-by: Kleber Sacilotto de Souza <email address hidden>

Jeremy Soller (jackpot51) wrote :

I have tested a kernel with _only_ 6c6bc1aee6a61fe7a98196794735849286761a4a "PCI: PM: Skip devices in D0 for suspend-to-idle" reverted. That appears to fix all issues with suspend on this system. I am now attempting to find a minimal patch for restoring functionality.

I found that the real commit that needs to be reverted is 3e26c5feed2add218046ecf91bab3cfa9bf762a6 from https://patchwork.kernel.org/patch/10993697/, meaning that 5.2 kernel is also affected

I am working to identify a minimal changeset to restore functionality on this system

Kai-Heng Feng (kaihengfeng) wrote :

Jeremy,

1) Does the system default to S2I or S3?
2) Does the issue happen when nvidia.ko is not loaded?

Jeremy Soller (jackpot51) wrote :

1) S3
2) No, it happens only when the nvidia driver is active

Kai-Heng Feng (kaihengfeng) wrote :

As of now PCI core skips pci_prepare_to_sleep(), which put device to D3, if pci_save_state() was called by the driver's suspend callback.

IMO there are two ways to handle this:
1) Don't call pci_save_state().
2) Manually put device to D3 after pci_save_state().

Let's try 1) first:

diff --git a/nv.c b/nv.c
index b6dc6f3..ed250c8 100644
--- a/nv.c
+++ b/nv.c
@@ -4200,8 +4200,6 @@ nv_power_management(
             nv_kthread_q_stop(&nvl->bottom_half_q);

             nv_disable_pat_support();
-
- pci_save_state(nvl->pci_dev);
             break;
         }
         case NV_PM_ACTION_RESUME:

Jeremy Soller (jackpot51) wrote :

Thanks, how should I apply this? To the /usr/src/nvidia-*/nvidia/nv.c file?

Jeremy Soller (jackpot51) wrote :

I will build the nvidia DKMS module with this patch and test it out, thanks.

Jeremy Soller (jackpot51) wrote :

Kai-Heng,

That patch when applied to the nvidia DKMS module appears to fix the suspend issues.

Kai-Heng Feng (kaihengfeng) wrote :

Alberto, I think we should raise the issue to Nvidia.

For a short term solution, is it possible to apply the change (comment #23) to all supported nvidia package versions in the interim?

no longer affects: linux-hwe-edge (Ubuntu)
Changed in linux (Ubuntu):
status: Confirmed → Invalid
Jeremy Soller (jackpot51) wrote :

I wonder if removing that line would cause issues on kernels before PCI: PM: Skip devices in D0 for suspend-to-idle, like the bionic kernel?

Kai-Heng Feng (kaihengfeng) wrote :

PCI core saves device's config space during suspend so I think it won't cause any issue.

Alberto Milone (albertomilone) wrote :

I can apply the change depending on the availability of the new nvme_* functions, so that kernels which don't include the two commits will not be affected.

I am going to notify NVIDIA too.

Alberto Milone (albertomilone) wrote :

Actually, the calls are not exposed, so, maybe we should rely on the kernel version instead.

Jeremy Soller (jackpot51) wrote :

Has there been any response from NVIDIA?

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nvidia-graphics-drivers (Ubuntu):
status: New → Confirmed
Chris (firstcaptain25) wrote :

Having the same issue, Is there a work around while it gets fixed?

Kai-Heng Feng (kaihengfeng) wrote :

Alberto,

I think the change can be applied universally. The PCI core will call pci_save_state() if driver doesn't do it.

Changed in linux:
status: Unknown → Confirmed
Changed in nvidia-graphics-drivers (Ubuntu):
assignee: nobody → Alberto Milone (albertomilone)

Hi Jeremy,
I've looked at the post at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1836630,
and it seems that it is a nvidia DKMS module issue which has already been fixed.

Changed in linux:
status: Confirmed → Fix Released
Alberto Milone (albertomilone) wrote :

Can you try the 430 driver from this PPA, please? (I have uploaded a fix for 18.04 and for 19.10)

Gregor Darius (gregordarius) wrote :

I have been testing both versions today on two notebooks. I will be done with testing tomorrow and inform you about the results.

Thank you!

Gregor Darius (gregordarius) wrote :

The test results:

Model: PB71RD
board_name: PB50_70RF,RD,RC
- Ubuntu 18.04.3
suspend and wake up works, no other problems found

- Ubuntu 19.10
not possible to test as the system behaves very unstable even without the new driver

Model: P970RC
board_name: P9XXRC
- Ubuntu 18.04.3
suspend and wake up works, no other problems found

Ubuntu 19.10
- suspend and wake up works, no other problems found

On both notebooks was freeze after suspend on wakeup. It is working now with the new driver.

Do you need any logfiles or any other kind of tests on those systems?

Wojciech Wieckowski (xplwowi) wrote :

Patch "do-not-call-pci_save_state.patch" that removes line "pci_save_state(nvl->pci_dev);" causes error 'gpu has fallen off the bus' after return from suspend.

DistroRelease: Ubuntu 19.10
Uname: Linux 5.3.0-18-generic x86_64

dmi.bios.date: 05/30/2018
dmi.bios.vendor: LENOVO
dmi.bios.version: GNET88WW (2.36 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20BG0016US
dmi.board.vendor: LENOVO
dmi.board.version: 0B98401 Pro
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvrGNET88WW(2.36):bd05/30/2018:svnLENOVO:pn20BG0016US:pvrThinkPadW540:rvnLENOVO:rn20BG0016US:rvr0B98401Pro:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.family: ThinkPad W540
dmi.product.name: 20BG0016US
dmi.product.sku: LENOVO_MT_20BG
dmi.product.version: ThinkPad W540
dmi.sys.vendor: LENOVO

Wojciech Wieckowski (xplwowi) wrote :
affects: nvidia-graphics-drivers (Ubuntu) → nvidia-graphics-drivers-430 (Ubuntu)
Alberto Milone (albertomilone) wrote :

@Wojciech: which nvidia package (and version) are you using?

Wojciech Wieckowski (xplwowi) wrote :

435.21-0ubuntu2 from official eoan repository.
When this patch is removed from dkms.conf, after recompiling driver and updating initramfs suspend/resume works as expected.

Alberto Milone (albertomilone) wrote :

I wonder if your system is doing S2Idle (suspend to idle) instead of S3. Either way, we should find a solution that works in both cases. The kernel will be the right place for this.

Wojciech Wieckowski (xplwowi) wrote :

Attached dmesg log shows S3 suspend and resume cycle on my machine with pci_save_state() enabled. From my point of view it looks pretty normal and fully supported.

Kai-Heng Feng (kaihengfeng) wrote :

The pci_save_state () will make the Nvidia PCI devices stay at D0 under S3. We need to put it to D3 to save power.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.