System does not reliably come out of suspend

Bug #1803179 reported by Mathieu Trudel-Lapierre
82
This bug affects 16 people
Affects Status Importance Assigned to Milestone
Linux
Incomplete
Medium
linux (Ubuntu)
Confirmed
Medium
Unassigned
nvidia-graphics-drivers-390 (Ubuntu)
Confirmed
Undecided
Unassigned
nvidia-graphics-drivers-410 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Dell XPS 15 (9750); it might eventually manage to suspend when the lid is closed, but more often than not will not wake up again when the lid is opened. Waking up using the power button often results in a system that is apparently frozen (graphics displayed are the last on screen before suspend, clock seconds do not change)

System is unresponsive to the keyboard at that time (can't switch to a VT or otherwise interact with the system other than holding the power button for a few seconds to shut it down).

ProblemType: Bug
DistroRelease: Ubuntu 19.04
Package: linux-image-4.18.0-10-generic 4.18.0-10.11
ProcVersionSignature: Ubuntu 4.18.0-10.11-generic 4.18.12
Uname: Linux 4.18.0-10-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.10-0ubuntu14
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: mtrudel 2516 F.... pulseaudio
CurrentDesktop: ubuntu:GNOME
Date: Tue Nov 13 10:42:08 2018
InstallationDate: Installed on 2018-11-02 (10 days ago)
InstallationMedia: Ubuntu 18.10 "Cosmic Cuttlefish" - Release amd64 (20181017.3)
MachineType: Dell Inc. XPS 15 9570
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.18.0-10-generic root=UUID=14900847-323c-4427-b59e-89210ec1c8ec ro quiet splash vt.handoff=1
RelatedPackageVersions:
 linux-restricted-modules-4.18.0-10-generic N/A
 linux-backports-modules-4.18.0-10-generic N/A
 linux-firmware 1.175
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 09/03/2018
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.5.0
dmi.board.name: 0D0T05
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 10
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.5.0:bd09/03/2018:svnDellInc.:pnXPS159570:pvr:rvnDellInc.:rn0D0T05:rvrA00:cvnDellInc.:ct10:cvr:
dmi.product.family: XPS
dmi.product.name: XPS 15 9570
dmi.product.sku: 087C
dmi.sys.vendor: Dell Inc.

Revision history for this message
In , peter (peter-linux-kernel-bugs) wrote :

Created attachment 232611
dmesg for v4.7-rc5 (triggered runtime-resume via writing "on" to (nvidia device)/power/control)

See also https://www.spinics.net/lists/linux-pci/msg53694.html ("Kernel Freeze with American Megatrends BIOS") for more details (acpidump, lspci, some analysis, etc.).

Steps to reproduce on the affected machines:

 1. Load nouveau.
 2. Wait for it to runtime suspend.
 2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau. (alternatively: write "on" to /sys/bus/pci/devices/0000:01:00.0/power/control)
 3. lspci never returns, few moments later an AML_INFINITE_LOOP is reported.

Affected machines from
https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-234494238
- Clevo P651RA (and other Clevo P6xxRx models).
- MSI GE62 Apache Pro
- Gigabyte P35V5
- Razer Blade 14" (2016)
- Dell Inspiron 7559

These *new* laptops all have an Skylake CPU (i7-6500HQ) and a Nvidia GTX 9xxM GPU. Originally it was only observed for laptops with AMI BIOSes, but later we found a Dell laptop as well. The workaround acpi_osi="!Windows 2015" prevents Linux from reporting Windows 10 compatibility and helps *in some cases* because the ACPI code falls back to a different approach to power on the device (or PCIe link?).

Attached is one of the more interesting dmesg dumps which could be obtained that shows how the system breaks down over time. (This was v4.7-rc5 with PCI/PM D3cold + nouveau power resource/PM refcount leaks patches, but the problem was also visible on unpatches 4.4.0 for example.)

Revision history for this message
In , rui.zhang (rui.zhang-linux-kernel-bugs) wrote :

let's focus on one platform first.
For people who encounters this problem and can give quick response, please attach the acpidump of the platform.

Revision history for this message
In , rui.zhang (rui.zhang-linux-kernel-bugs) wrote :

Okay, let's focus on Clevo_P651RA first.

Revision history for this message
In , rui.zhang (rui.zhang-linux-kernel-bugs) wrote :

I don't see how to download the acpidump file at https://github.com/Lekensteyn/acpi-stuff/blob/master/dsl/Clevo_P651RA/acpidump.txt
can you please attach it here?

Revision history for this message
In , peter (peter-linux-kernel-bugs) wrote :

Created attachment 233091
acpidump for Clevo P651RA (BIOS 1.05.07)

You can download the file via the "Raw" link on Github. I have attached a copy of the acpidump.

Of interest is the \_SB.PCI0.PGON method. See also this extract:
https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt#L94

Revision history for this message
In , lenb (lenb-linux-kernel-bugs) wrote :

Does this still fail if you use the proprietary nvidia driver?

Revision history for this message
In , lv.zheng (lv.zheng-linux-kernel-bugs) wrote :

Peter:
Should you first try this: attachment 239241

Rui:
Do you have PCI contact? Can we have them to look at the issue first?
From this link:
https://www.spinics.net/lists/linux-pci/msg53694.html
Looks like a PCI power management gap if the attachment 239241 doesn't help.

Thanks
Lv

Revision history for this message
In , peter (peter-linux-kernel-bugs) wrote :

(In reply to Len Brown from comment #5)
> Does this still fail if you use the proprietary nvidia driver?

I have not tried the proprietary driver, but AFAIK the blob does no attempts to put the device in D3 state.

(In reply to Lv Zheng from comment #6)
> Peter:
> Should you first try this: attachment 239241 [details]

I can try, but would it really help? Not all firmware have this loop and they will just assume that the link state is correct. This is the affected loop:

    While ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
        Local0 = 0x20
        While (Local0) {
            If ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
                Stall (0x64)
                Local0--
            } Else {
                Break
            }
        }

        If ((Local0 == Zero)) {
            \_SB.PCI0.PEG0.RTLK = One
            Stall (0x64)
        }
    }

In one trace I observed that the outer loop was executed 29 times which means that about 29 * (32 * 100us + 100us) = 95.7ms.

Revision history for this message
In , lv.zheng (lv.zheng-linux-kernel-bugs) wrote :
Download full text (3.4 KiB)

Do you mean it's already long enough (95.7ms) for this case, and waiting longer won't solve the issue?
I don't know, I just want to get rid of the possible bug causes.

I'm not a PCI expert. So let me ask.
From the following AML, RTLK/LNKS belong to a PCI register space:
    OperationRegion (SANV, SystemMemory, 0x5FF9BD98, 0x0135)
    Field (SANV, AnyAcc, Lock, Preserve)
    {
        ASLB, 32,
        IMON, 8,
        IGDS, 8,
        IBTT, 8,
        IPAT, 8,
        IPSC, 8,
        IBIA, 8,
        ISSC, 8,
        IDMS, 8,
        IF1E, 8,
        HVCO, 8,
        GSMI, 8,
        PAVP, 8,
        CADL, 8,
        CSTE, 16,
        NSTE, 16,
        NDID, 8,
        DID1, 32,
        DID2, 32,
        DID3, 32,
        DID4, 32,
        DID5, 32,
        DID6, 32,
        DID7, 32,
        DID8, 32,
        DID9, 32,
        DIDA, 32,
        DIDB, 32,
        DIDC, 32,
        DIDD, 32,
        DIDE, 32,
        DIDF, 32,
        DIDX, 32,
        NXD1, 32,
        NXD2, 32,
        NXD3, 32,
        NXD4, 32,
        NXD5, 32,
        NXD6, 32,
        NXD7, 32,
        NXD8, 32,
        NXDX, 32,
        LIDS, 8,
        KSV0, 32,
        KSV1, 8,
        BRTL, 8,
        ALSE, 8,
        ALAF, 8,
        LLOW, 8,
        LHIH, 8,
        ALFP, 8,
        IMTP, 8,
        EDPV, 8,
        SGMD, 8,
        SGFL, 8,
        SGGP, 8,
        HRE0, 8,
        HRG0, 32,
        HRA0, 8,
        PWE0, 8,
        PWG0, 32,
        PWA0, 8,
        P1GP, 8,
        HRE1, 8,
        HRG1, 32,
        HRA1, 8,
        PWE1, 8,
        PWG1, 32,
        PWA1, 8,
        P2GP, 8,
        HRE2, 8,
        HRG2, 32,
        HRA2, 8,
        PWE2, 8,
        PWG2, 32,
        PWA2, 8,
        DLPW, 16,
        DLHR, 16,
        EECP, 8,
        XBAS, 32, <- XBAS
        GBAS, 16,
        NVGA, 32,
        NVHA, 32,
        AMDA, 32,
        LTRX, 8,
        OBFX, 8,
        LTRY, 8,
        OBFY, 8,
        LTRZ, 8,
        OBFZ, 8,
        SMSL, 16,
        SNSL, 16,
        P0UB, 8,
        P1UB, 8,
        P2UB, 8,
        PCSL, 8,
        PBGE, 8,
        M64B, 64,
        M64L, 64,
        CPEX, 32,
        EEC1, 8,
        EEC2, 8,
        SBN0, 8,
        SBN1, 8,
        SBN2, 8,
        M32B, 32,
        M32L, 32,
        P0WK, 32,
        P1WK, 32,
        P2WK, 32,
        MXD1, 32,
        MXD2, 32,
        MXD3, 32,
        MXD4, 32,
        MXD5, 32,
        MXD6, 32,
        MXD7, 32,
        MXD8, 32,
        PXFD, 8,
        EBAS, 32,
        DGVS, 32,
        DGVB, 32,
        HYSS, 32
    }

        OperationRegion (RPCX, SystemMemory, Add (\XBAS, 0x8000), 0x1000)
        Field (RPCX, ByteAcc, NoLock, Preserve)
        {
            Offset (0x04),
            CMDR, 8,
            Offset (0x84),
            D0ST, 2,
            Offset (0xAA),
            CEDR, ...

Read more...

Revision history for this message
In , lv.zheng (lv.zheng-linux-kernel-bugs) wrote :

It looks like AML code in PGON prior than this loop should always make the condition true. What the platform need to do is to wait.
So IMO, the code prior than this loop is more important for root causing this issue.

Revision history for this message
In , peter (peter-linux-kernel-bugs) wrote :

(In reply to Lv Zheng from comment #8)
> Do you mean it's already long enough (95.7ms) for this case, and waiting
> longer won't solve the issue?

That would be the theoretical delay. In practice, I have several seconds of processing due to ACPI debug logging (ACPI_NAMESPACE, ACPI_DB_NAMES). The logs stop after 46 seconds, maybe because I used SysRq+B for a forced reboot (reset).

> I'm not a PCI expert. So let me ask.
> From the following AML, RTLK/LNKS belong to a PCI register space:
> OperationRegion (SANV, SystemMemory, 0x5FF9BD98, 0x0135)
> Field (SANV, AnyAcc, Lock, Preserve)
> {
[snip]
> Can you infer what it is from the above AML?

XBAS is the PCIe MMIO Base Address register. I guessed that "RTLK" means "Retrain Link" (see PCIe spec 7.8.7 Link Control Register) and that "LNKS" means PCIe Link speed. I posted these on:

https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt

(In reply to Lv Zheng from comment #9)
> It looks like AML code in PGON prior than this loop should always make the
> condition true. What the platform need to do is to wait.
> So IMO, the code prior than this loop is more important for root causing
> this issue.

The loop is indeed just a consequence, the root cause is due to the difference between invoking the "LKEN" code (problematic, see line 120 of notes.txt) and the fallback code (see line 141 of notes.txt).

However I am quite at loss on why it would be so significant. Note that I am no PCI expert either, the notes were based on the PCIe spec, ACPI tables and lots of guesswork.

Do you need more info?

Revision history for this message
In , lv.zheng (lv.zheng-linux-kernel-bugs) wrote :

Let me re-assign it to Power-management category and reset the assignee to involve more developers.

Thanks
Lv

Revision history for this message
In , rjw (rjw-linux-kernel-bugs) wrote :

Peter, one question: Why is this not regarded as a nouveau problem?

Revision history for this message
In , peter (peter-linux-kernel-bugs) wrote :

(In reply to Rafael J. Wysocki from comment #12)
> Peter, one question: Why is this not regarded as a nouveau problem?

Something changed in Windows 10 that made firmware authors write this specific DSDT workaround. If Linux advertises itself as Windows 7 for example, the problematic code is not triggered. (Some laptops also work when advertising "non-Windows 10", such as Windows 8).

It could be a missing piece in the nouveau driver, but exactly how to tackle that is not known. In a minimal module that uses the new PCI port runtime PM ("PR3 support") introduced with v4.8, I could also trigger the lockups.

Are you aware of changes to the policies in Windows 10 that could explain the different methods of putting a device into D3? Timing-wise or other APIs changes?

Revision history for this message
In , rjw (rjw-linux-kernel-bugs) wrote :

On Tuesday, September 27, 2016 09:28:34 AM <email address hidden> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=156341
>
> --- Comment #13 from Peter Wu <email address hidden> ---
> (In reply to Rafael J. Wysocki from comment #12)
> > Peter, one question: Why is this not regarded as a nouveau problem?
>
> Something changed in Windows 10 that made firmware authors write this
> specific
> DSDT workaround. If Linux advertises itself as Windows 7 for example, the
> problematic code is not triggered. (Some laptops also work when advertising
> "non-Windows 10", such as Windows 8).
>
> It could be a missing piece in the nouveau driver, but exactly how to tackle
> that is not known. In a minimal module that uses the new PCI port runtime PM
> ("PR3 support") introduced with v4.8, I could also trigger the lockups.
>
> Are you aware of changes to the policies in Windows 10 that could explain the
> different methods of putting a device into D3? Timing-wise or other APIs
> changes?

Not at the moment, but I'm going to ask around.

Revision history for this message
In , rjw (rjw-linux-kernel-bugs) wrote :

One difference between Windows 10 and Windows 7 I know about is that Windows 10 supports power management of PCIe ports and I bet the ASL in comment #7 is needed to cope with that.

That PCIe ports PM appears to be different from what we're going to do in 4.8+, though, which may be the source of the problem.

Revision history for this message
In , peter (peter-linux-kernel-bugs) wrote :

(In reply to Rafael J. Wysocki from comment #15)
> One difference between Windows 10 and Windows 7 I know about is that Windows
> 10 supports power management of PCIe ports and I bet the ASL in comment #7
> is needed to cope with that.
>
> That PCIe ports PM appears to be different from what we're going to do in
> 4.8+, though, which may be the source of the problem.

The invoked ACPI methods (_ON/_OFF on the power resource) are matching between Linux and Windows 10. From a packet capture with WinDbg kernel debugger:
https://lekensteyn.nl/files/p651ra-acpi-debug/acpi-evals.txt

Maybe some extra modifications are needed to the PCIe registers? (No idea, just guessing.)

Revision history for this message
In , samm (samm-linux-kernel-bugs) wrote :

Tested against 4.9-RC2 on Fedora 25 and the problem still exists

Revision history for this message
In , peter (peter-linux-kernel-bugs) wrote :

(In reply to Rafael J. Wysocki from comment #15)
> That PCIe ports PM appears to be different from what we're going to do in
> 4.8+, though, which may be the source of the problem.

This is not the source of the problem, the issue exists before with older kernels.

The list of affected models keeps growing, there have been reports from additional HP, Dell and Asus laptops. All of these have in common a Skylake CPU (i7-6700HQ) and some NVIDIA GPU (Maxwell cards, GTX 950M/960M/965M/970M, Quadro M1000M). Some of the laptops are listed at the updated list in
https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-234494238

Any idea what to look into? Patches, documentation or other possible hints?

Failing a long-term solution, I am considering a temporary ACPI hack that patches the affected ACPI method to disable the conditional OSYS check:
https://github.com/Bumblebee-Project/bbswitch/issues/134#issuecomment-258117908

Revision history for this message
In , samm (samm-linux-kernel-bugs) wrote :

Created attachment 243561
attachment-10037-0.html

Auto-reply: I'm out of the office at present and will be back in on the 7th, please contact <email address hidden> if you require a response.

Revision history for this message
In , rjw (rjw-linux-kernel-bugs) wrote :

(In reply to Peter Wu from comment #18)
> (In reply to Rafael J. Wysocki from comment #15)
> > That PCIe ports PM appears to be different from what we're going to do in
> > 4.8+, though, which may be the source of the problem.
>
> This is not the source of the problem, the issue exists before with older
> kernels.
>
> The list of affected models keeps growing, there have been reports from
> additional HP, Dell and Asus laptops. All of these have in common a Skylake
> CPU (i7-6700HQ) and some NVIDIA GPU (Maxwell cards, GTX 950M/960M/965M/970M,
> Quadro M1000M). Some of the laptops are listed at the updated list in
> https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-
> 234494238
>
> Any idea what to look into? Patches, documentation or other possible hints?

You said that acpi_osi="!Windows 2015" helped in some cases. I guess the other cases (where it doesn't help) are Windows 10 only systems?

Revision history for this message
In , rjw (rjw-linux-kernel-bugs) wrote :

And what if we simply avoided using ACPI PM with the affected device on those systems?

Revision history for this message
In , peter (peter-linux-kernel-bugs) wrote :

> You said that acpi_osi="!Windows 2015" helped in some cases. I guess the
> other cases (where it doesn't help) are Windows 10 only systems?

Not sure, I did not check if these systems have support for just w10 (and not 7, 8 or 8.1). Some others require acpi_osi=! acpi_osi="Windows 2009" to avoid the problematic code path in the ACPI table.

(In reply to Rafael J. Wysocki from comment #21)
> And what if we simply avoided using ACPI PM with the affected device on
> those systems?

You mean acpi=off? Avoiding runtime pm nouveau would be sufficient but kills battery life. One interesting observation is that turning off the ACPI power resource (via PCIe port PM) or system sleep seems not to trigger the issue. (Compared to using nouveau.) Maybe I'm dreaming, have to retest this just to be sure.

Do you have tips for tracing PCI register activities? (E.g. read/write pm regs)

Revision history for this message
In , billybrawner (billybrawner-linux-kernel-bugs) wrote :

Hi everyone, I'm hoping to provide some helpful information here. I'm affected by this bug, in that I can't login to gnome unless I either blacklist the nouveau module or add "nouveau.runpm=0" to my kernel parameters. I've got some files here that I hope are of use to you:

Link to laptop: https://www.newegg.com/Product/Product.aspx?Item=N82E16834234412
Link to call trace where I can't login: https://paste.fedoraproject.org/533827/14851039/
Tar archive with system info: http://wbrawner.com/files/ASUSTeK_COMPUTER_INC.-X550VX.tar.gz

For what it's worth, I can't even get to the login screen with the proprietary nVidia drivers. Please let me know if I can otherwise be of assistance.

Revision history for this message
In , agronick (agronick-linux-kernel-bugs) wrote :

I just wanted to report that this issue is present on Lenovo W541 with 4.9.4-1. You can see my full bug report here for the symptoms: https://bugzilla.opensuse.org/show_bug.cgi?id=1022443

Putting acpi_osi=! acpi_osi="Windows 2009" in the Grub boot line fixed it.

Here are my GPUs:
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06)
01:00.0 VGA compatible controller: NVIDIA Corporation GK107GLM [Quadro K1100M] (rev a1)

cpuinfo prints:
Vendor ID: GenuineIntel
Hardware Raw:
Brand: Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz
Hz Advertised: 2.8000 GHz
Hz Actual: 2.8000 GHz
Hz Advertised Raw: (2800000000, 0)
Hz Actual Raw: (2800000000, 0)
Arch: X86_64
Bits: 64
Count: 8
Raw Arch String: x86_64
L2 Cache Size: 6144 KB
L2 Cache Line Size: 0
L2 Cache Associativity: 0
Stepping: 3
Model: 60
Family: 6
Processor Type: 0
Extended Model: 0
Extended Family: 0
Flags: abm, acpi, aes, aperfmperf, apic, arat, arch_perfmon, avx, avx2, bmi1, bmi2, bts, clflush, cmov, constant_tsc, cx16, cx8, de, ds_cpl, dtes64, dtherm, dts, eagerfpu, epb, ept, erms, est, f16c, flexpriority, fma, fpu, fsgsbase, fxsr, ht, ida, invpcid, lahf_lm, lm, mca, mce, mmx, monitor, movbe, msr, mtrr, nonstop_tsc, nopl, nx, pae, pat, pbe, pcid, pclmulqdq, pdcm, pdpe1gb, pebs, pge, pln, pni, popcnt, pse, pse36, pts, rdrand, rdtscp, rep_good, sdbg, sep, smep, smx, ss, sse, sse2, sse4_1, sse4_2, ssse3, syscall, tm, tm2, tpr_shadow, tsc, tsc_adjust, tsc_deadline_timer, vme, vmx, vnmi, vpid, x2apic, xsave, xsaveopt, xtopology, xtpr

Revision history for this message
In , gbloisi (gbloisi-linux-kernel-bugs) wrote :

This issue is present on ASUS n552vw-fi056t (Core i7-6700HQ and NVIDIA GeForce GTX 960M) with kernel 4.9.9.

Putting acpi_osi=! acpi_osi="Windows 2009" in the Grub worked around the problem, even though some functionality is lost (screen dimmering shortcuts).

Revision history for this message
In , a.pronobis (a.pronobis-linux-kernel-bugs) wrote :

The issue is also present on KabyLake Dell XPS15 9560 with i7-7700HQ with NVidia GTX1050M. It manifests itself with complete freezes if the intel card is used for X and the NVidia card is disabled with bumblebee. Then, running nvidia-smi, lspci casuses freeze. The freezes do not happen if NVidia card is enabled using bbswitch.

Some info from lspci:

00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04)
01:00.0 3D controller: NVIDIA Corporation Device 1c8d (rev a1)

Revision history for this message
In , a.pronobis (a.pronobis-linux-kernel-bugs) wrote :

I would like to add that acpi_osi="!Windows 2015" does not solve the problem, while acpi_osi=! acpi_osi="Windows 2009" does (it does disable the touchpad though).

Revision history for this message
In , rui.zhang (rui.zhang-linux-kernel-bugs) wrote :

@Andrzej, please attach the acpidump output of your laptop.

Revision history for this message
In , a.pronobis (a.pronobis-linux-kernel-bugs) wrote :

Created attachment 254763
acpidump for Dell XPS15 9560 KabyLake i7-7700HQ/GTX1050M BIOS 1.0.3

Here comes the acpidump for my system: Dell XPS15 9560 KabyLake i7-7700HQ/GTX1050M BIOS 1.0.3

Revision history for this message
In , gbloisi (gbloisi-linux-kernel-bugs) wrote :

Created attachment 254777
acpidump for ASUS N552VW-FI056T SkyLake i7-6700HQ/GTX 960M BIOS 3.0.0

Revision history for this message
In , a.pronobis (a.pronobis-linux-kernel-bugs) wrote :

Is there anything else I can do to help debug this issue?

Revision history for this message
In , tobe.schumacher (tobe.schumacher-linux-kernel-bugs) wrote :

Created attachment 254843
Patch for XPS 9560

I am facing the same problem on an XPS 9560 and had a look at the acpidump, here the corresponding check is as follows:

If ((OSYS <= 0x07D9) || ((OSYS == 0x07DF) && (_REV == 0x05)))

So, telling the BIOS that we support ACPI Rev. 5 should be sufficient for this model to allow powering down the Nvidia without locking up. There is already some code which does this for other XPS and Latitude models in drivers/acpi/blacklist.c, I extended it for the XPS 9560. I also sent the patch to the LKML.

38 comments hidden view all 195 comments
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.20 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20-rc2

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Seems like this is caused by nvidia drivers not liking s2idle or forcing the use of 'deep' leads to reliable suspend/resume cycles. Properly removing nvidia drivers leads to a working system.

Revision history for this message
snevas (snevas) wrote :

Some added info. This bug is present in:
driver : nvidia-driver-415 - third-party free recommended
driver : nvidia-driver-410 - third-party free
driver : nvidia-driver-390 - distro non-free

GPU info:
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001C8Csv00001028sd0000087Cbc03sc02i00
vendor : NVIDIA Corporation
model : GP107M [GeForce GTX 1050 Ti Mobile]

When in the faulty state `nvidia-smi` displays a 100% GPU Utilization.
Using `prime-select intel` to switch GPU and killing X gives back your display environment using onboard GPU.

snevas (snevas)
no longer affects: linux
Revision history for this message
snevas (snevas) wrote :
Revision history for this message
snevas (snevas) wrote :

A workaround is adding the following boot patameters:
`acpi_rev_override=1 acpi_osi=Linux nouveau.modeset=0 pcie_aspm=force drm.vblankoffdelay=1 scsi_mod.use_blk_mq=1 nouveau.runpm=0 mem_sleep_default=deep`

(Thanks to the input of nvidia forum and github.com/JackHack96/dell-xps-9570-ubuntu-respin )

Changed in linux:
importance: Unknown → Medium
status: Unknown → Incomplete
Changed in nvidia-graphics-drivers-390 (Ubuntu):
status: New → Confirmed
Changed in nvidia-graphics-drivers-410 (Ubuntu):
status: New → Confirmed
description: updated
Brad Figg (brad-figg)
tags: added: cscc
148 comments hidden view all 195 comments
Revision history for this message
In , russell.kernel (russell.kernel-linux-kernel-bugs) wrote :

After hours of experimenting on this laptop :

Computer : PC Specialist OptimusIX 15 (aka Clevo N8xxEP6)
BIOS : American Megatrends 1.07.13
OS : Arch Linux
GPU : NVIDIA GTX 1060 Mobile

Until recently, any attempt to use bumblebee or acpi commands to power down the GPU have resulted in a system freeze with lspci, suspend, power cable plug in, etc. No kernel line parameters seem to have any effect.

I have discovered that the system freeze is closely linked to the interaction between the nvidia graphics card on pci address 0000:01:00.0 and its associated sound card at pci address 0000:01:00.1 (I don't actually know what that sound card is doing - I presume it's for the HDMI port?)

If I completely disable the audio card using :
echo 1 | sudo tee /sys/bus/pci/devices/0000:01:00.1/remove

Then the system hangs are completely cured - I can acpi _OFF or _ON or _PS3 or _PS0 to my hearts content and the gfx card will power up and down perfectly, lspci behaves perfectly normally (without any lag), and suspend/resume and power cable plug/unplug all works. Even better, kernel power management on the PCI bus seems to work perfectly too, but only kicks in when I rmmod nvidia. So far, bumblebee and bbswitch also seem to be totally happy.

Can anybody else confirm similar findings?

Bear in mind that the audio card needs to be removed BEFORE the kernel loads any audio modules. I do it like this :

[Unit]
Description=Nvidia Audio Card OnBoot Disabler
Before=bumblebeed.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/sh -c "echo 1 > /sys/bus/pci/devices/0000:01:00.1/remove"
ExecStop=/usr/bin/sh -c "echo 1 > /sys/bus/pci/rescan"

[Install]
WantedBy=sysinit.target

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

(In reply to arpie from comment #140)
> After hours of experimenting on this laptop :
>
> Computer : PC Specialist OptimusIX 15 (aka Clevo N8xxEP6)
> BIOS : American Megatrends 1.07.13
> OS : Arch Linux
> GPU : NVIDIA GTX 1060 Mobile
>
> Until recently, any attempt to use bumblebee or acpi commands to power down
> the GPU have resulted in a system freeze with lspci, suspend, power cable
> plug in, etc. No kernel line parameters seem to have any effect.
>
> I have discovered that the system freeze is closely linked to the
> interaction between the nvidia graphics card on pci address 0000:01:00.0 and
> its associated sound card at pci address 0000:01:00.1 (I don't actually
> know what that sound card is doing - I presume it's for the HDMI port?)
>
> If I completely disable the audio card using :
> echo 1 | sudo tee /sys/bus/pci/devices/0000:01:00.1/remove
>
> Then the system hangs are completely cured - I can acpi _OFF or _ON or _PS3
> or _PS0 to my hearts content and the gfx card will power up and down
> perfectly, lspci behaves perfectly normally (without any lag), and
> suspend/resume and power cable plug/unplug all works. Even better, kernel
> power management on the PCI bus seems to work perfectly too, but only kicks
> in when I rmmod nvidia. So far, bumblebee and bbswitch also seem to be
> totally happy.
>
> Can anybody else confirm similar findings?
>
> Bear in mind that the audio card needs to be removed BEFORE the kernel loads
> any audio modules. I do it like this :
>
> [Unit]
> Description=Nvidia Audio Card OnBoot Disabler
> Before=bumblebeed.service
>
> [Service]
> Type=oneshot
> RemainAfterExit=yes
> ExecStart=/usr/bin/sh -c "echo 1 > /sys/bus/pci/devices/0000:01:00.1/remove"
> ExecStop=/usr/bin/sh -c "echo 1 > /sys/bus/pci/rescan"
>
> [Install]
> WantedBy=sysinit.target

Not working for me. Still freezing with this.

Revision history for this message
In , russell.kernel (russell.kernel-linux-kernel-bugs) wrote :

(In reply to Matthias Fulz from comment #141)
> (In reply to arpie from comment #140)
[snip]
> > If I completely disable the audio card using :
> > echo 1 | sudo tee /sys/bus/pci/devices/0000:01:00.1/remove
> >
> > Then the system hangs are completely cured
>
> Not working for me. Still freezing with this.

Any chance of more details? When and how is it freezing? Is it any different from before? What are your machine/card details (looks like you haven't posted these anywhere above)?

Also, are you absolutely sure you've disabled the audio card during boot *before the kernel notices it is there*? The only reliable way I've found to check if this is the case, is to run powertop, and look in the 'Device Status' tab for listings of 'Audio codec hwXXXXX: nvidia'. If that is showing up, then the nvidia sound card is still active and will cause hangs. My solution only works if the audio card is removed/disabled before the audio system initialises during boot (hence the WantedBy=sysinit.target in my service file).

I think I should have also mentioned that in order for the kernel to do the PM, you need to do something like :

echo auto | sudo tee /sys/bus/pci/devices/0000:01:00.0/power/control

I have TLP installed, which does this for me.

Now a few days have passed, I admit I have had a few freezes when using bbswitch. But if I disable bbswitch and just use bumblebee with no power management, all is well (so far). If I want to power down the nvidia GFX card I just manually modprobe -r nvidia and the kernel does the rest. Using this solution, I see a drop from about 20W to 10W when the card powers off, with no ACPI calls at all (or, rather, none that I am aware of - I have no idea what the kernel is actually doing behind the scenes).

I am sure that there must be a 'proper' solution where the correct ACPI commands are used to power off/on both the nvidia video and audio at the same time but finding such a solution is far beyond me...

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

(In reply to arpie from comment #142)
> (In reply to Matthias Fulz from comment #141)
> > (In reply to arpie from comment #140)
> [snip]
> > > If I completely disable the audio card using :
> > > echo 1 | sudo tee /sys/bus/pci/devices/0000:01:00.1/remove
> > >
> > > Then the system hangs are completely cured
> >
> > Not working for me. Still freezing with this.
>
> Any chance of more details? When and how is it freezing? Is it any
> different from before? What are your machine/card details (looks like you
> haven't posted these anywhere above)?
>

I've got a HP OMEN 15 with a nvidia GTX 1050 running archlinux

> Also, are you absolutely sure you've disabled the audio card during boot
> *before the kernel notices it is there*? The only reliable way I've found
> to check if this is the case, is to run powertop, and look in the 'Device
> Status' tab for listings of 'Audio codec hwXXXXX: nvidia'. If that is
> showing up, then the nvidia sound card is still active and will cause hangs.
> My solution only works if the audio card is removed/disabled before the
> audio system initialises during boot (hence the WantedBy=sysinit.target in
> my service file).
>

I've used your service file together with bumblebee and bbswitch.

> I think I should have also mentioned that in order for the kernel to do the
> PM, you need to do something like :
>
> echo auto | sudo tee /sys/bus/pci/devices/0000:01:00.0/power/control
>
> I have TLP installed, which does this for me.
>

Ok this step was missing.

> Now a few days have passed, I admit I have had a few freezes when using
> bbswitch. But if I disable bbswitch and just use bumblebee with no power
> management, all is well (so far). If I want to power down the nvidia GFX
> card I just manually modprobe -r nvidia and the kernel does the rest.
> Using this solution, I see a drop from about 20W to 10W when the card powers
> off, with no ACPI calls at all (or, rather, none that I am aware of - I have
> no idea what the kernel is actually doing behind the scenes).
>

Ah I see.
Then I think this is basically somehow similar to my workaround using the snd_hda_intel modul parameter.
The nvidia card will just be completely "powered off" by not using it in any way (no module loaded)

> I am sure that there must be a 'proper' solution where the correct ACPI
> commands are used to power off/on both the nvidia video and audio at the
> same time but finding such a solution is far beyond me...

I think some ACPI / PM guys should definitly check the audio part of the GPU as there could be some issues related to this bug.

I will try it perhaps once again and give feedback here.
But honestly these tests are really harmful for me because it happens very often that some files are truncated to zero during this crash randomly and I've to restore backups then...

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

Ok here are more tests:

Disabling the audio part with your suggestion ist working.
No NVIDIA audio in powertop nor in lspci.

Your solution is basically loading / unloading the nvidia modul now, which indeed is working, but not the optimus part I think?

As soon as I try acpi the freeze is happening again after one or two times running lspci.

And here the real problem starts:

Just loading and afterwards unloading the nvidia module wakes up the card from the real disabled state. Even powertop telling 0% usage of the nvidia card my power consumption is not going below 11W again.

So the issue here is: You're way is not triggering the real shutdown for the nvidia card as you're just unloading the module.

The difference in using bbswitch (which leads to freezes for you as well) is that this will do the acpi calls and really powering down the card, which leads to the freezes...

For some users it might be fully ok to just use load / unload nvidia as it make a difference for the power consumption.

But for me it's around 1/3 missing runtime, which relly hurts me :)

But perhaps you could try my workaround with two boot entries and check the power consumption on your side, when running intel only?

for me it's around 7-8W intel only and around 11W when using your workaround.
But again your solution is just not really disabling the nvidia card, instead it's more like just not using it and let it stay in idle mode with limited PM.

Revision history for this message
In , kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote :

Laptops with Skylake SoC and later shouldn't need bbswitch. PCIe port PM will disable the power of the card.
After nvidia.ko gets unloaded, make sure "power/control" is "auto" for its video (e.g. 01:00.0) and audio (e.g. 01:00.1) functions and its upstream bridge (use lspci -t to check).

In addition to that, these two commits are also required for mainline kernel users:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=52525b7a3cf82adec5c6cf0ecbd23ff228badc94
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=bacd861452d2be86a4df341b12e32db7dac8021e

Revision history for this message
In , russell.kernel (russell.kernel-linux-kernel-bugs) wrote :

(In reply to Matthias Fulz from comment #143)
> (In reply to arpie from comment #142)
> > (In reply to Matthias Fulz from comment #141)
> > > (In reply to arpie from comment #140)
> > [snip]
>
> Ah I see.
> Then I think this is basically somehow similar to my workaround using the
> snd_hda_intel modul parameter.
> The nvidia card will just be completely "powered off" by not using it in any
> way (no module loaded)

Yes, now I've read your workaround more closely, I think you're right it is basically achieving the same thing.

> > I am sure that there must be a 'proper' solution where the correct ACPI
> > commands are used to power off/on both the nvidia video and audio at the
> > same time but finding such a solution is far beyond me...
>
> I think some ACPI / PM guys should definitly check the audio part of the GPU
> as there could be some issues related to this bug.

Judging by comment 145, they are already way ahead of us!

Revision history for this message
In , russell.kernel (russell.kernel-linux-kernel-bugs) wrote :

(In reply to Matthias Fulz from comment #144)
> Ok here are more tests:
>
> Disabling the audio part with your suggestion ist working.
> No NVIDIA audio in powertop nor in lspci.
>
> Your solution is basically loading / unloading the nvidia modul now, which
> indeed is working, but not the optimus part I think?

Optimus IS working here, no problem at all.

[snip]
> For some users it might be fully ok to just use load / unload nvidia as it
> make a difference for the power consumption.
>
> But for me it's around 1/3 missing runtime, which relly hurts me :)
>
> But perhaps you could try my workaround with two boot entries and check the
> power consumption on your side, when running intel only?

I will try this maybe tonight when I will have more time to spare. I too would be interested in gaining 20-30% battery life! But, then again, I wouldn't want to have to reboot to be able to use the dGPU (I use it for blender3d).

> for me it's around 7-8W intel only and around 11W when using your workaround.
> But again your solution is just not really disabling the nvidia card,
> instead it's more like just not using it and let it stay in idle mode with
> limited PM.

Yes, and no... as far as I can see from my tests, it is not staying in idle mode, it is being fully powered down by the kernel when not in use, and fully powered up again when I use optirun.

Revision history for this message
In , russell.kernel (russell.kernel-linux-kernel-bugs) wrote :

(In reply to Kai-Heng Feng from comment #145)

Thank you very much for this optimistic-sounding info, Kai-Heng Feng.

> Laptops with Skylake SoC and later shouldn't need bbswitch. PCIe port PM
> will disable the power of the card.

How do I check if I have Skylake SoC?

I am actually currently using a modified bbswitch where I have disabled the acpi calls. The point of this is to force bumblebee to automatically load and unload the nvidia modules before and after using optirun. I suspect there is an easier way but for now this works for me.

> After nvidia.ko gets unloaded, make sure "power/control" is "auto" for its
> video (e.g. 01:00.0) and audio (e.g. 01:00.1) functions and its upstream
> bridge (use lspci -t to check).

I initially tried what you describe here but the audio part was preventing power management from happening because it was permanently flagged in use by the snd_hda_audio module. Hence my work-around.

> In addition to that, these two commits are also required for mainline kernel
> users:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?id=52525b7a3cf82adec5c6cf0ecbd23ff228badc94
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?id=bacd861452d2be86a4df341b12e32db7dac8021e

I am not interested in attempting to compile the kernel so will wait for these two commits to make it into the stable release. Reading their descriptions, especially the second one, sounds like it is the perfect fix for what I am experiencing.

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

(In reply to Kai-Heng Feng from comment #145)
> Laptops with Skylake SoC and later shouldn't need bbswitch. PCIe port PM
> will disable the power of the card.
> After nvidia.ko gets unloaded, make sure "power/control" is "auto" for its
> video (e.g. 01:00.0) and audio (e.g. 01:00.1) functions and its upstream
> bridge (use lspci -t to check).
>
> In addition to that, these two commits are also required for mainline kernel
> users:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?id=52525b7a3cf82adec5c6cf0ecbd23ff228badc94
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?id=bacd861452d2be86a4df341b12e32db7dac8021e

Ok I've a coffee lake and will go to try these two commits and report back.

Revision history for this message
odror (ozdror) wrote :

I have hp-spectre x360
i7-9750h, nvidia gtx 1650

I have the same issue. This is the first laptop that I have, which is not suitable for linux because of this issue. Any ideas when this issue will be fix for any kernel.

Revision history for this message
In , adikurthy (adikurthy-linux-kernel-bugs) wrote :

(In reply to Kai-Heng Feng from comment #145)
> Laptops with Skylake SoC and later shouldn't need bbswitch. PCIe port PM
> will disable the power of the card.
> After nvidia.ko gets unloaded, make sure "power/control" is "auto" for its
> video (e.g. 01:00.0) and audio (e.g. 01:00.1) functions and its upstream
> bridge (use lspci -t to check).
>
> In addition to that, these two commits are also required for mainline kernel
> users:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?id=52525b7a3cf82adec5c6cf0ecbd23ff228badc94
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?id=bacd861452d2be86a4df341b12e32db7dac8021e

I have an i7 8750H with a GTX 1050 Mobile. I applied these two patches on top of Linus' tree. After I switched all "power/control" to "auto", everything works now. Card powers down, suspend/resume works.

Thank you for figuring this out. Before this I was getting lockups with bbswitch/acpi_call during boot. I had to do crazy workarounds to get away with this during early boot and suspend/resume. Those days are gone now!

Revision history for this message
odror (ozdror) wrote :

on Which kernel (of the mainline) the patches are applied. Is it coming soon or I have to apply the patches myself.

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

I've tryed the patches and can confirm that the issues with lockups are gone with just using bumblebee (unloading nvidia module).

But still the problems are the same:
1.) Just unloading the nvidia modules keeps the power consumption up to 13/14W which is 5-6W more (almoest double) in compare to intel only ~8W

2.) Using acpi call to poweroff the nvidia card completely which drops the power consumption to 8W the lockups are back.

So for me the patches are not working :(

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

Of course I've set the power to auto for everything

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

More infos:

powertop shows this when using intel gpu only (my workarounds from post https://bugzilla.kernel.org/show_bug.cgi?id=156341#c139)

              0.0% PCI Device: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16)
              0.0% PCI Device: Intel Corporation Cannon Lake PCH HECI Controller
              0.0% PCI Device: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile]
              0.0% PCI Device: Intel Corporation Cannon Lake PCH SPI Controller
              0.0% PCI Device: Intel Corporation Cannon Lake PCH cAVS
              0.0% PCI Device: NVIDIA Corporation GP107GL High Definition Audio Controller
              0.0% PCI Device: Intel Corporation Cannon Lake PCH Shared SRAM
              0.0% PCI Device: Intel Corporation Cannon Lake PCH SMBus Controller
              0.0% PCI Device: Intel Corporation Cannon Lake PCH PCI Express Root Port #14

when using the patches with nvidia modules unloaded and power set to auto its still saying 100%

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

Ok after I removed the patches and the normal Kernel-Update to 5.3.11 happend, I'm experiencing the same higher power consumption that happend during the test before.

It could be related to something else not the patches.

But I'm unable to find out atm. where it comes from :(
The pc is not going below 10W with 5.3.11
on 5.3.8 it drops to 7-8W.

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

ok I've found out that this difference seems to be gone when using tlp instead of laptop-mode-tools.

Kernel 5.5-rc3 (including these patches) + Bumblebee and tlp provides me the full optimus usage + 8/9W power consumption on intel GPU.

For me everything is working absolutely perfect now !!! :)

Revision history for this message
In , daniel.gomme (daniel.gomme-linux-kernel-bugs) wrote :

Are these patches included in kernel 5.5.2? I blacklist nvidia modules on boot, and "echo 'OFF' | sudo tee /proc/acpi/bbswitch" (which succeeds) followed by lspci also produces a freeze on my machine. Without the call to bbswitch, the freeze does not happen, but my idle power usage is at 25-30W, instead of the ~11 that occurs if I do invoke bbswitch.

I'm running Arch, with the 5.5.2 kernel. Hardware is an Intel i7-8750H and NVidia GTX 1070M, and my kernel parameters are "root=/dev/nvme0n1p5 rw add_efi_memmap initrd=intel-ucode.img initrd=initramfs-%v.img nopti intel_iommu=on iommu=on sysrq_always_enabled=1".

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

Yep the actual arch kernel is working fine here, including these patches:
Linux omega 5.5.2-arch1-1 #1 SMP PREEMPT Tue, 04 Feb 2020 18:56:18 +0000 x86_64 GNU/Linux

Best Powerconsumption so far 6-7W normal working (wlan, 25% display, browsing, etc.)

No lockups anymore, using bumblebee (primusrun, optirun) just need to manually unload the nvidia modules afterwards, which I've included in simple scripts.

Thanks at all for this :)

Revision history for this message
In , daniel.gomme (daniel.gomme-linux-kernel-bugs) wrote :

(In reply to Matthias Fulz from comment #157)
> Yep the actual arch kernel is working fine here, including these patches:
> Linux omega 5.5.2-arch1-1 #1 SMP PREEMPT Tue, 04 Feb 2020 18:56:18 +0000
> x86_64 GNU/Linux
>
> Best Powerconsumption so far 6-7W normal working (wlan, 25% display,
> browsing, etc.)
>
> No lockups anymore, using bumblebee (primusrun, optirun) just need to
> manually unload the nvidia modules afterwards, which I've included in simple
> scripts.
>
> Thanks at all for this :)

I'm currently using the vanilla kernel and still run into the lockups I mentioned above. Do I need to apply patches myself? Or is it just the stuff I'm using (eg I only use bbswitch, on top of optimus-manager)?

Sorry if this is a little asking the obvious, just had this problem for a while now and not sure how to deal with it.

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

Yep the problem is bbswitch. Just deinstall it, make sure the power mode for nvidia gpu and hdmi sound are set to auto and unload the nvidia & nvidia_modeset modules.

bbswitch will still lead to the lockups, if used.

Revision history for this message
In , daniel.gomme (daniel.gomme-linux-kernel-bugs) wrote :

(In reply to Matthias Fulz from comment #159)
> Yep the problem is bbswitch. Just deinstall it, make sure the power mode for
> nvidia gpu and hdmi sound are set to auto and unload the nvidia &
> nvidia_modeset modules.
>
> bbswitch will still lead to the lockups, if used.

Awesome! Blacklisting the modules, then setting runtime power management to auto for everything in powertop put the power usage down to about 8W. Loading the modules back up again manually puts the power usage back up, and unloading them back down, without having to use bbswitch at all.

Do you know if it's possible to, with an X session that started up without the nvidia modules loaded, then turn those modules on and use that GPU with "__NV_PRIME_RENDER_OFFLOAD=1 __VK_LAYER_NV_optimus=NVIDIA_only __GLX_VENDOR_LIBRARY_NAME=nvidia "$@"" (the PRIME render offload in the recent drivers). I've so far not had any success :(

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

I'm just using bumblebee and do not blacklist nvidia modules.

This is my /etc/bumblebee/xorg.conf.nvidia

Section "ServerLayout"
    Identifier "Layout0"
    Option "AutoAddDevices" "true"
    Option "AutoAddGPU" "false"
EndSection

Section "Device"
    Identifier "DiscreteNvidia"
    Driver "nvidia"
    VendorName "NVIDIA Corporation"

    Option "NoLogo" "true"
    Option "UseEDID" "false"
    Option "AllowEmptyInitialConfiguration"
EndSection

Section "Screen"
    Identifier "Screen0"
    Device "DiscreteNvidia"
EndSection

And I'm just using an additional systemd service for powertop /etc/systemd/system/powertop.service:

[Unit]
Description=PowerTOP auto tune

[Service]
Type=idle
Environment="TERM=dumb"
ExecStart=/usr/bin/bash -c "sleep 30 && /usr/bin/powertop --auto-tune && sleep 10 && echo 'on' > '/sys/bus/usb/devices/1-1/power/control'"

[Install]
WantedBy=multi-user.target

Then I've everything ready and can use:
primusrun
optirum

for nvidia GPU stuff. For the unloading of the modules after the use I'm using the following scripts:

/usr/local/bin/primusrun

#!/bin/bash

trap unload 1 2 3 6

unload() {
    /usr/bin/lsmod | grep nvidia > /dev/null 2>&1
    if [ $? -eq 0 ]
    then
        echo "unloading nvidia modules ..."
        sleep 2
        /usr/bin/lsmod | grep nvidia_modeset > /dev/null 2>&1
        if [ $? -eq 0 ]
        then
            sudo /usr/bin/rmmod nvidia_modeset
        fi
        sudo /usr/bin/rmmod nvidia
        echo "finished."
    fi
}

primusrun $@
unload

/usr/local/bin/optirun

#!/bin/bash

trap unload 1 2 3 6

unload() {
    /usr/bin/lsmod | grep nvidia > /dev/null 2>&1
    if [ $? -eq 0 ]
    then
        echo "unloading nvidia modules ..."
        sleep 2
        /usr/bin/lsmod | grep nvidia_modeset > /dev/null 2>&1
        if [ $? -eq 0 ]
        then
            sudo /usr/bin/rmmod nvidia_modeset
        fi
        sudo /usr/bin/rmmod nvidia
        echo "finished."
    fi
}

optirun $@
unload

That's it for me.

In addition for powersavings I'm using TLP with quite default settings.

Offloading inside the nvidia drivers is just not really helpful afaik :)

Hope that helps.

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

Additional information:

The on is for my logitech receiver as it is annoying like hell when powersavings are enabled for it. Will need to move the mouse for 2s before it*s working again.

And the sleep is needed to be able to have all the hardware available before powertop will change the modes.

Revision history for this message
In , daniel.gomme (daniel.gomme-linux-kernel-bugs) wrote :

(In reply to Matthias Fulz from comment #161)
> I'm just using bumblebee and do not blacklist nvidia modules.
>
> This is my /etc/bumblebee/xorg.conf.nvidia
>
> Section "ServerLayout"
> Identifier "Layout0"
> Option "AutoAddDevices" "true"
> Option "AutoAddGPU" "false"
> EndSection
>
> Section "Device"
> Identifier "DiscreteNvidia"
> Driver "nvidia"
> VendorName "NVIDIA Corporation"
>
>
> Option "NoLogo" "true"
> Option "UseEDID" "false"
> Option "AllowEmptyInitialConfiguration"
> EndSection
>
> Section "Screen"
> Identifier "Screen0"
> Device "DiscreteNvidia"
> EndSection
>
> And I'm just using an additional systemd service for powertop
> /etc/systemd/system/powertop.service:
>
> [Unit]
> Description=PowerTOP auto tune
>
> [Service]
> Type=idle
> Environment="TERM=dumb"
> ExecStart=/usr/bin/bash -c "sleep 30 && /usr/bin/powertop --auto-tune &&
> sleep 10 && echo 'on' > '/sys/bus/usb/devices/1-1/power/control'"
>
> [Install]
> WantedBy=multi-user.target
>
> Then I've everything ready and can use:
> primusrun
> optirum
>
> for nvidia GPU stuff. For the unloading of the modules after the use I'm
> using the following scripts:
>
> /usr/local/bin/primusrun
>
> #!/bin/bash
>
> trap unload 1 2 3 6
>
> unload() {
> /usr/bin/lsmod | grep nvidia > /dev/null 2>&1
> if [ $? -eq 0 ]
> then
> echo "unloading nvidia modules ..."
> sleep 2
> /usr/bin/lsmod | grep nvidia_modeset > /dev/null 2>&1
> if [ $? -eq 0 ]
> then
> sudo /usr/bin/rmmod nvidia_modeset
> fi
> sudo /usr/bin/rmmod nvidia
> echo "finished."
> fi
> }
>
> primusrun $@
> unload
>
> /usr/local/bin/optirun
>
> #!/bin/bash
>
> trap unload 1 2 3 6
>
> unload() {
> /usr/bin/lsmod | grep nvidia > /dev/null 2>&1
> if [ $? -eq 0 ]
> then
> echo "unloading nvidia modules ..."
> sleep 2
> /usr/bin/lsmod | grep nvidia_modeset > /dev/null 2>&1
> if [ $? -eq 0 ]
> then
> sudo /usr/bin/rmmod nvidia_modeset
> fi
> sudo /usr/bin/rmmod nvidia
> echo "finished."
> fi
> }
>
> optirun $@
> unload
>
> That's it for me.
>
> In addition for powersavings I'm using TLP with quite default settings.
>
> Offloading inside the nvidia drivers is just not really helpful afaik :)
>
> Hope that helps.

Cheers! I'll see what I can do there :) Thank you so much for all the help

Revision history for this message
In , ranjithshegde (ranjithshegde-linux-kernel-bugs) wrote :

(In reply to Matthias Fulz from comment #161)
> I'm just using bumblebee and do not blacklist nvidia modules.
>
> This is my /etc/bumblebee/xorg.conf.nvidia
>
> Section "ServerLayout"
> Identifier "Layout0"
> Option "AutoAddDevices" "true"
> Option "AutoAddGPU" "false"
> EndSection
>
> Section "Device"
> Identifier "DiscreteNvidia"
> Driver "nvidia"
> VendorName "NVIDIA Corporation"
>
>
> Option "NoLogo" "true"
> Option "UseEDID" "false"
> Option "AllowEmptyInitialConfiguration"
> EndSection
>
> Section "Screen"
> Identifier "Screen0"
> Device "DiscreteNvidia"
> EndSection
>
> And I'm just using an additional systemd service for powertop
> /etc/systemd/system/powertop.service:
>
> [Unit]
> Description=PowerTOP auto tune
>
> [Service]
> Type=idle
> Environment="TERM=dumb"
> ExecStart=/usr/bin/bash -c "sleep 30 && /usr/bin/powertop --auto-tune &&
> sleep 10 && echo 'on' > '/sys/bus/usb/devices/1-1/power/control'"
>
> [Install]
> WantedBy=multi-user.target
>
> Then I've everything ready and can use:
> primusrun
> optirum
>
> for nvidia GPU stuff. For the unloading of the modules after the use I'm
> using the following scripts:
>
> /usr/local/bin/primusrun
>
> #!/bin/bash
>
> trap unload 1 2 3 6
>
> unload() {
> /usr/bin/lsmod | grep nvidia > /dev/null 2>&1
> if [ $? -eq 0 ]
> then
> echo "unloading nvidia modules ..."
> sleep 2
> /usr/bin/lsmod | grep nvidia_modeset > /dev/null 2>&1
> if [ $? -eq 0 ]
> then
> sudo /usr/bin/rmmod nvidia_modeset
> fi
> sudo /usr/bin/rmmod nvidia
> echo "finished."
> fi
> }
>
> primusrun $@
> unload
>
> /usr/local/bin/optirun
>
> #!/bin/bash
>
> trap unload 1 2 3 6
>
> unload() {
> /usr/bin/lsmod | grep nvidia > /dev/null 2>&1
> if [ $? -eq 0 ]
> then
> echo "unloading nvidia modules ..."
> sleep 2
> /usr/bin/lsmod | grep nvidia_modeset > /dev/null 2>&1
> if [ $? -eq 0 ]
> then
> sudo /usr/bin/rmmod nvidia_modeset
> fi
> sudo /usr/bin/rmmod nvidia
> echo "finished."
> fi
> }
>
> optirun $@
> unload
>
> That's it for me.
>
> In addition for powersavings I'm using TLP with quite default settings.
>
> Offloading inside the nvidia drivers is just not really helpful afaik :)
>
> Hope that helps.

Hello,
Thank you for your patch and effort. I tried your primusrun patch. First I get an infinite repetitions of this
/bin/bash: warning: shell level (1000) too high, resetting to 1

and when I stop (C-c) I get an infinite loop of this

rmmod: ERROR: Module nvidia is not currently loaded
finished.
unloading nvidia modules ...
rmmod: ERROR: Module nvidia is not currently loaded
finished.

nothing launches..
 Any ideas?

I am on Arch, optimus laptop with intel coffeelake and RTX 2070
I have bumblebeed enabled with your recommended xorg.nvidia settings, using powertop pm to turn off Nvidia card which works fine

Revision history for this message
In , mfulz (mfulz-linux-kernel-bugs) wrote :

(In reply to Ranjith Hegde from comment #164)
> (In reply to Matthias Fulz from comment #161)
>
> Hello,
> Thank you for your patch and effort. I tried your primusrun patch. First I
> get an infinite repetitions of this
> /bin/bash: warning: shell level (1000) too high, resetting to 1
>
Ok this is NOT a patch :)
It's just a simple script to run optirun / primusrun encapsulated to load and unload the nvidia module.

> and when I stop (C-c) I get an infinite loop of this
>
> rmmod: ERROR: Module nvidia is not currently loaded
> finished.
> unloading nvidia modules ...
> rmmod: ERROR: Module nvidia is not currently loaded
> finished.
>
> nothing launches..
> Any ideas?
>
Yes: I'm quite sure you've got /usr/local/bin in your PATH and this before the /usr/bin entry, where optirun / primusrun should be placed in.

Two possible solutions:

1.) Change the lines primusrun $@ and optirun $@ to use the full path a.e. /usr/bin/primusrun $@ instead of primusrun $@

2.) Rename the scripts to something like /usr/local/bin/primusrun.sh and /usr/local/bin/optirun.sh

Second solution will avoid any naming clushes for sure.

Hope that helps

BR,
Matthias

Revision history for this message
In , ranjithshegde (ranjithshegde-linux-kernel-bugs) wrote :

(In reply to Matthias Fulz from comment #165)
> (In reply to Ranjith Hegde from comment #164)
> > (In reply to Matthias Fulz from comment #161)
> >
> > Hello,
> > Thank you for your patch and effort. I tried your primusrun patch. First I
> > get an infinite repetitions of this
> > /bin/bash: warning: shell level (1000) too high, resetting to 1
> >
> Ok this is NOT a patch :)
> It's just a simple script to run optirun / primusrun encapsulated to load
> and unload the nvidia module.
>
> > and when I stop (C-c) I get an infinite loop of this
> >
> > rmmod: ERROR: Module nvidia is not currently loaded
> > finished.
> > unloading nvidia modules ...
> > rmmod: ERROR: Module nvidia is not currently loaded
> > finished.
> >
> > nothing launches..
> > Any ideas?
> >
> Yes: I'm quite sure you've got /usr/local/bin in your PATH and this before
> the /usr/bin entry, where optirun / primusrun should be placed in.
>
> Two possible solutions:
>
> 1.) Change the lines primusrun $@ and optirun $@ to use the full path a.e.
> /usr/bin/primusrun $@ instead of primusrun $@
>
> 2.) Rename the scripts to something like /usr/local/bin/primusrun.sh and
> /usr/local/bin/optirun.sh
>
> Second solution will avoid any naming clushes for sure.
>
> Hope that helps
>
> BR,
> Matthias

Thank you for your reply!

Firstly about the naming hickup. I am from computer music background and any piece of code is referred to as a patch. I know its quite different in shell scripting or general programming world. my bad

I tried your second solution of using .sh and running it after I close all programs running on Nvidia. it works. Thanks

If I were to place your script (!) directly inside usr/bin/primusrun then where would you suggest I put it. here's how it looks, unedited

#!/bin/bash

# Readback-display synchronization method
# 0: no sync, 1: D lags behind one frame, 2: fully synced
# export PRIMUS_SYNC=${PRIMUS_SYNC:-0}

# Verbosity level
# 0: only errors, 1: warnings (default), 2: profiling
# export PRIMUS_VERBOSE=${PRIMUS_VERBOSE:-1}

# Upload/display method
# 0: autodetect, 1: textures, 2: PBO/glDrawPixels (needs Mesa-10.1+)
# export PRIMUS_UPLOAD=${PRIMUS_UPLOAD:-0}

# Approximate sleep ratio in the readback thread, percent
# export PRIMUS_SLEEP=${PRIMUS_SLEEP:-90}

# Secondary display
# export PRIMUS_DISPLAY=${PRIMUS_DISPLAY:-:8}

# "Accelerating" libGL
# $LIB will be interpreted by the dynamic linker
# export PRIMUS_libGLa=${PRIMUS_libGLa:-'/usr/$LIB/nvidia/libGL.so.1'}

# "Displaying" libGL
# export PRIMUS_libGLd=${PRIMUS_libGLd:-'/usr/$LIB/libGL.so.1'}

# Directory containing primus libGL
PRIMUS_libGL=/usr/\$LIB/primus

# On some distributions, e.g. on Ubuntu, libnvidia-tls.so is not available
# in default search paths. Add its path manually after the primus library
# PRIMUS_libGL=${PRIMUS_libGL}:/usr/lib/nvidia-current:/usr/lib32/nvidia-current

# Mesa drivers need a few symbols to be visible
# export PRIMUS_LOAD_GLOBAL=${PRIMUS_LOAD_GLOBAL:-'libglapi.so.0'}

# Need functions from primus libGL to take precedence
export LD_LIBRARY_PATH=${PRIMUS_libGL}${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}

# And go!
export __GLVND_DISALLOW_PATCHING=1
exec "$@"

Revision history for this message
In , karolherbst (karolherbst-linux-kernel-bugs) wrote :

btw, this bug should be fixed with the nouveau river in 5.7 (and backports should follow soon for 5.4, 5.5 and 5.6)

Revision history for this message
In , xxxaaa (xxxaaa-linux-kernel-bugs) wrote :

(In reply to Karol Herbst from comment #167)
> btw, this bug should be fixed with the nouveau river in 5.7 (and backports
> should follow soon for 5.4, 5.5 and 5.6)

I have similar issue https://bugzilla.kernel.org/show_bug.cgi?id=206727 but I use NVIDIA PRIME driver. Will your bugfix work for that too or does it work only for nouveau drivers?

Revision history for this message
In , karolherbst (karolherbst-linux-kernel-bugs) wrote :

(In reply to xxxaaa from comment #168)
> (In reply to Karol Herbst from comment #167)
> > btw, this bug should be fixed with the nouveau river in 5.7 (and backports
> > should follow soon for 5.4, 5.5 and 5.6)
>
> I have similar issue https://bugzilla.kernel.org/show_bug.cgi?id=206727 but
> I use NVIDIA PRIME driver. Will your bugfix work for that too or does it
> work only for nouveau drivers?

the fix is for Nouveau only, but I am actually in contact with Nvidia about the same issue they have inside their driver as well.

But the bug can be fixed from within the open source files inside the nvidia driver as well I just hope that Nvidia will have a proper solution here as the fix in Nouveau is also more of a workaround than a real fix.

Revision history for this message
In , kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote :

Karol Herbst,

Is it fixed by removing the OSI vendor strings?

Revision history for this message
In , karolherbst (karolherbst-linux-kernel-bugs) wrote :

(In reply to Kai-Heng Feng from comment #170)
> Karol Herbst,
>
> Is it fixed by removing the OSI vendor strings?

should be, yes.

Revision history for this message
In , ranjithshegde (ranjithshegde-linux-kernel-bugs) wrote :

New problems in an old thread

After using @Matthias Fulz's method (bumblebee without bbswitch + pm with powertop+tlp) its quite simple to run any software requiring nvidia. with his patch.

The problem is in preventing any other software from using nvidia when not run with optirun/primusrun. For example, any software that calls for EGL automatically turns on nvidia. Mpv is one of those.

Is there any way to ensure EGL calls intel instead? There is solution in arch wiki with env-variable "__EGL_VENDOR_LIBRARY_FILENAMES="/usr/share/glvnd/egl_vendor.d/50_mesa.json"
"

but it does not work. I have tried adding it in all possible places (/etc/environment, /etc/profile, bash-profile, zsh-profile-equivalent etc..)

Revision history for this message
In , xxxaaa (xxxaaa-linux-kernel-bugs) wrote :

(In reply to Karol Herbst from comment #169)
> (In reply to xxxaaa from comment #168)
> > (In reply to Karol Herbst from comment #167)
> > > btw, this bug should be fixed with the nouveau river in 5.7 (and
> backports
> > > should follow soon for 5.4, 5.5 and 5.6)
> >
> > I have similar issue https://bugzilla.kernel.org/show_bug.cgi?id=206727 but
> > I use NVIDIA PRIME driver. Will your bugfix work for that too or does it
> > work only for nouveau drivers?
>
> the fix is for Nouveau only, but I am actually in contact with Nvidia about
> the same issue they have inside their driver as well.
>
> But the bug can be fixed from within the open source files inside the nvidia
> driver as well I just hope that Nvidia will have a proper solution here as
> the fix in Nouveau is also more of a workaround than a real fix.

(Sorry for being impatient but this bug has been bugging me a lot.)
Do you know if this has been fixed on Nvidia Drivers yet or not? And if it has not been fixed then whether it is on their roadmap or not?

Revision history for this message
In , karolherbst (karolherbst-linux-kernel-bugs) wrote :

(In reply to xxxaaa from comment #173)
> (In reply to Karol Herbst from comment #169)
> > (In reply to xxxaaa from comment #168)
> > > (In reply to Karol Herbst from comment #167)
> > > > btw, this bug should be fixed with the nouveau river in 5.7 (and
> > backports
> > > > should follow soon for 5.4, 5.5 and 5.6)
> > >
> > > I have similar issue https://bugzilla.kernel.org/show_bug.cgi?id=206727
> but
> > > I use NVIDIA PRIME driver. Will your bugfix work for that too or does it
> > > work only for nouveau drivers?
> >
> > the fix is for Nouveau only, but I am actually in contact with Nvidia about
> > the same issue they have inside their driver as well.
> >
> > But the bug can be fixed from within the open source files inside the
> nvidia
> > driver as well I just hope that Nvidia will have a proper solution here as
> > the fix in Nouveau is also more of a workaround than a real fix.
>
> (Sorry for being impatient but this bug has been bugging me a lot.)
> Do you know if this has been fixed on Nvidia Drivers yet or not? And if it
> has not been fixed then whether it is on their roadmap or not?

Afaik they do not and my attempts to get them to even reproduce the issue failed...

so I don't see it getting fixed any time soon and honestly, if I have to put more time into it I might even stop caring as the fix we have for Nouveau is good enough and appears to have resolved the issue for good.

In case you are still plagued by it I suspect you might want to bring that up to Nvidia directly and see if they get more motivated fixing it if they get more reports.

Revision history for this message
In , xxxaaa (xxxaaa-linux-kernel-bugs) wrote :

(In reply to Karol Herbst from comment #174)
> (In reply to xxxaaa from comment #173)
> > (In reply to Karol Herbst from comment #169)
> > > (In reply to xxxaaa from comment #168)
> > > > (In reply to Karol Herbst from comment #167)
> > > > > btw, this bug should be fixed with the nouveau river in 5.7 (and
> > > backports
> > > > > should follow soon for 5.4, 5.5 and 5.6)
> > > >
> > > > I have similar issue https://bugzilla.kernel.org/show_bug.cgi?id=206727
> > but
> > > > I use NVIDIA PRIME driver. Will your bugfix work for that too or does
> it
> > > > work only for nouveau drivers?
> > >
> > > the fix is for Nouveau only, but I am actually in contact with Nvidia
> about
> > > the same issue they have inside their driver as well.
> > >
> > > But the bug can be fixed from within the open source files inside the
> > nvidia
> > > driver as well I just hope that Nvidia will have a proper solution here
> as
> > > the fix in Nouveau is also more of a workaround than a real fix.
> >
> > (Sorry for being impatient but this bug has been bugging me a lot.)
> > Do you know if this has been fixed on Nvidia Drivers yet or not? And if it
> > has not been fixed then whether it is on their roadmap or not?
>
> Afaik they do not and my attempts to get them to even reproduce the issue
> failed...
>
> so I don't see it getting fixed any time soon and honestly, if I have to put
> more time into it I might even stop caring as the fix we have for Nouveau is
> good enough and appears to have resolved the issue for good.
>
> In case you are still plagued by it I suspect you might want to bring that
> up to Nvidia directly and see if they get more motivated fixing it if they
> get more reports.

I had asked about it in the nvidia linux forums some time back but nothing came out of it either :(
https://forums.developer.nvidia.com/t/bug-cant-change-power-state-from-d3cold-to-d0-config-space-inaccessible-stuck-at-boot/112912

Some time back you had said -

> But the bug can be fixed from within the open source files inside the nvidia
> driver

So, can your nouveau patch be applied to nvidia driver as well?

Revision history for this message
In , nheart (nheart-linux-kernel-bugs) wrote :

(In reply to Ranjith Hegde from comment #172)
> New problems in an old thread
>
> After using @Matthias Fulz's method (bumblebee without bbswitch + pm with
> powertop+tlp) its quite simple to run any software requiring nvidia. with
> his patch.
>
> The problem is in preventing any other software from using nvidia when not
> run with optirun/primusrun. For example, any software that calls for EGL
> automatically turns on nvidia. Mpv is one of those.
>
> Is there any way to ensure EGL calls intel instead? There is solution in
> arch wiki with env-variable
> "__EGL_VENDOR_LIBRARY_FILENAMES="/usr/share/glvnd/egl_vendor.d/50_mesa.json"
> "
>
> but it does not work. I have tried adding it in all possible places
> (/etc/environment, /etc/profile, bash-profile, zsh-profile-equivalent etc..)

Hey,

I was also struggling with those problems. What I do is that instead of removing the nvidia-audio from the PCIe Bus, I remove the nvidia card:

% cat /usr/local/bin/remove_nvidia
#!/usr/bin/bash
echo 1 > /sys/bus/pci/devices/0000:01:00.1/remove
echo 1 > /sys/bus/pci/devices/0000:01:00.0/remove

I do this at boot via systemd. There is a post earlier that describes how to do it. You should also make sure that the nvidia modules are not loaded (should be taken care if you use bumblebee).

In order to re-enable the nvidia card when you need it, use:

% cat /usr/local/bin/add_nvidia
#!/usr/bin/env bash
echo 1 > /sys/bus/pci/rescan

You can modify your bumbleebee/optimus/primus scripts accordingly.

Revision history for this message
In , karolherbst (karolherbst-linux-kernel-bugs) wrote :

since this bug is fixed in nouveau and the last responses here are all related to the nvidia driver can we close this bug?

If people are having issues with the nvidia driver they should contact Nvidia about it. Anyway, removing myself from CC due to the noise.

Displaying first 40 and last 40 comments. View all 195 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers