ucsi_ccg 50 second hang while resuming from s2ram with nvidia, recent kernels

Bug #1850238 reported by Nickolay Ponomarev
132
This bug affects 24 people
Affects Status Importance Assigned to Milestone
linux (Arch Linux)
New
Undecided
Unassigned
linux (Ubuntu)
Confirmed
Undecided
Unassigned
Bionic
Won't Fix
Undecided
Unassigned
Eoan
Fix Released
Undecided
Unassigned
Focal
Confirmed
Undecided
Unassigned
linux-oem-osp1 (Ubuntu)
Confirmed
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned
Eoan
Fix Released
Undecided
Unassigned
Focal
Won't Fix
Undecided
Unassigned

Bug Description

=== SRU Justification ===
[Impact]
Some systems have a "phantom" Nvidia UCSI, which prevent systems from
suspending.

[Fix]
ucsi_ccg is stuck in its probe routine because of the i2c bus never
timeouts. Let it timeouts and probe can fail since it's just a phantom
device.

[Test]
After applying this patch system can suspend/resume succesfully.

[Regression Potential]
Low. It's a trivial change to correctly handle timeout.

=== Original Bug Report ===

Short version
=============
I'm experiencing a 50-second hang each time I resume from a "deep" (suspend-to-RAM) sleep.

It happens with the newer kernel (5.3 series; I'm currently running the version from eoan-proposed), but not with the version from the Ubuntu 18.04.3 LTS (uname says "5.0.0-31-generic #33~18.04.1-Ubuntu SMP").

[I haven't yet tried to test the mainline builds, nor to find/confirm the regression range, as this seems like something that will take me another week, and I'm not sure if it would be helpful.]

I narrowed the problem down to what I believe is a broken USB Type-C controller on the NVIDIA GPU: the ucsi_ccg driver for /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.3/i2c-0/0-0008 reports a timeout for both the initial PPM_RESET command (on system startup) and for the SET_NOTIFICATION_ENABLE command the driver runs on resume.

I guess the hang is the driver waiting for a response to SET_NOTIFICATION_ENABLE; it appears to have been added recently in https://github.com/torvalds/linux/commit/a94ecde41f7e51e2742e53b5f151aee662c54d39, which could explain why I don't see the hang with 5.0.x.

Creating /etc/modprobe.d/dell.conf with a `blacklist ucsi_ccg` line (and rebooting) makes the hang go away.

Steps to reproduce
==================
(these are not the actual steps one can take to reproduce, starting from a new install; let me know if those will be useful)

1. Boot Ubuntu 19.10 with NVIDIA GPU drivers uninstalled and the following kernel parameters <https://askubuntu.com/questions/19486/how-do-i-add-a-kernel-boot-parameter>:
 nouveau.modeset=0 nouveau.runpm=0 # force using integrated graphics
                                            # (the problem can be reproduced using NVIDIA's proprietary driver too, but I
                                            # guessed it's better to avoid it, and nouveau prints lots of errors with this GPU)
 mem_sleep_default=deep # suspend to RAM; suspend-to-idle has its own problems on this system

2. Run `dmesg -w` and wait a minute or two until a message like the following is printed:

 [ 175.611346] ucsi_ccg 0-0008: failed to reset PPM!
 [ 175.611355] ucsi_ccg 0-0008: PPM init failed (-110)

(attempting to suspend before the PPM init timeout will fail to enter sleep at all.)
(if your system doesn't report PPM init timeout, you probably won't see the hang on resume either)

3. Run `sudo pm-suspend` (using the power button to suspend causes other problems)

...wait for the laptop to go to sleep and the fans to turn off.

4. Press Enter on the built-in keyboard to resume. (Although the way we wake up the system doesn't seem to matter.)

5. Observe a hang lasting for almost a minute before the system is operational, with dmesg reporting:

 [ 299.331393] ata1.00: configured for UDMA/100
 <note the 47 second long gap>

 [ 346.133024] ucsi_ccg 0-0008: PPM NOT RESPONDING
 [ 346.133039] PM: dpm_run_callback(): ucsi_ccg_resume+0x0/0x20 [ucsi_ccg] returns -110
 [ 346.133042] PM: Device 0-0008 failed to resume: error -110
 ...
 [ 346.141504] Restarting tasks ... done.
 [ 346.340221] PM: suspend exit

System info
===========

My Dell G3 3590 laptop has an NVIDIA "GeForce GTX 1660 Ti with Max-Q Design" GPU.
NVIDIA's "Turing" chips include USB Type-C controller on the GPU (I read future VR headsets are supposed to use it <https://github.com/envytools/envytools/search?q=4d151a19358579c77487ea3f72c32dc97c0250f7..ffd2dc9146482a5469209bbc861ed80adb066d31&type=Commits>), and indeed I'm seeing:

# lspci -tv
-[0000:00]-+-00.0 Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers
           +-01.0-[01]--+-00.0 NVIDIA Corporation TU116M [GeForce GTX 1660 Ti Mobile]
           | +-00.1 NVIDIA Corporation Device 1aeb
           | +-00.2 NVIDIA Corporation Device 1aec
           | \-00.3 NVIDIA Corporation Device 1aed
...

Where the '1aed' device is detected as "NVIDIA USB Type-C Port Policy Controller" in Windows.

I'm not sure if it's serving any useful purpose on this laptop, and it certainly doesn't seem to function properly:

If I enable UCSI logging on startup (root's crontab):

 @reboot bash -c 'echo 1 > /sys/kernel/debug/tracing/events/ucsi/enable'

..the steps to reproduce above result in the following /sys/kernel/debug/tracing/trace:
# tracer: nop
#
# entries-in-buffer/entries-written: 10/10 #P:12
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
     kworker/6:2-679 [006] .... 68.593915: ucsi_command: control=00000001 (PPM_RESET)
     kworker/6:1-187 [006] .... 151.599387: ucsi_notify: CCI=00000000
     kworker/6:2-679 [006] .... 175.617158: ucsi_reset_ppm: PPM_RESET -> FAIL (err=-110)
     kworker/6:1-187 [006] .... 211.582572: ucsi_notify: CCI=00000000
     kworker/6:1-187 [006] .... 253.577823: ucsi_notify: CCI=00000000
     kworker/6:1-187 [006] .... 295.574520: ucsi_notify: CCI=00000000
      pm-suspend-3448 [007] .... 298.115894: ucsi_command: control=dbe70005 (SET_NOTIFICATION_ENABLE)
      pm-suspend-3448 [005] .... 346.138850: ucsi_run_command: SET_NOTIFICATION_ENABLE -> FAIL (err=-110)
     kworker/6:1-187 [006] .... 370.904651: ucsi_notify: CCI=00000000
     kworker/6:1-187 [006] .... 412.901709: ucsi_notify: CCI=00000000

I updated the BIOS to the latest available (08/28/2019) and installed (by booting into Windows) all the other updates available for this system from the vendor. I don't know how to check what is the firmware version of the USB-C chip on the GPU and whether it even exists...

ProblemType: Bug
DistroRelease: Ubuntu 19.10
Package: linux-image-5.3.0-20-generic 5.3.0-20.21
ProcVersionSignature: Ubuntu 5.3.0-20.21-generic 5.3.7
Uname: Linux 5.3.0-20-generic x86_64
ApportVersion: 2.20.11-0ubuntu8
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: nickolay 1668 F.... pulseaudio
 /dev/snd/controlC0: nickolay 1668 F.... pulseaudio
CurrentDesktop: ubuntu:GNOME
Date: Tue Oct 29 01:21:28 2019
InstallationDate: Installed on 2019-10-20 (8 days ago)
InstallationMedia: Ubuntu 19.10 "Eoan Ermine" - Release amd64 (20191017)
MachineType: Dell Inc. G3 3590
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.3.0-20-generic root=UUID=0b40d72f-d832-47f6-ab77-faccfb6547fe ro nouveau.modeset=0 nouveau.runpm=0 mem_sleep_default=deep quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-5.3.0-20-generic N/A
 linux-backports-modules-5.3.0-20-generic N/A
 linux-firmware 1.183.1
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 08/28/2019
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.7.1
dmi.board.name: 061RYD
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 10
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.7.1:bd08/28/2019:svnDellInc.:pnG33590:pvr:rvnDellInc.:rn061RYD:rvrA00:cvnDellInc.:ct10:cvr:
dmi.product.family: GSeries
dmi.product.name: G3 3590
dmi.product.sku: 0949
dmi.sys.vendor: Dell Inc.

CVE References

Revision history for this message
Nickolay Ponomarev (asqueella) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: disco
Revision history for this message
Jeremy Sanders (jeremysanders) wrote :

Thanks for the diagnosis. This also happens for me on a desktop system with a Gigabyte GeForce RTX 2060 Gaming OC Pro 6G graphics card (rev 2.0), on a Gigabyte Z370P D3 motherboard. Blacklisting stops the problem.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

This commit might help, please test it:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?
id=71a1fa0df2a3728b8ccb97394be420d1f03df40e

Revision history for this message
Tim Sweeney (mainetim) wrote :

This is still present in the focal daily release.
5.4.0-14-generic #17-Ubuntu

tags: added: focal
Revision history for this message
Anshul Jethvani (polymorphis007) wrote :
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you guys test this kernel without blacklisting ucsi_ccg:
https://people.canonical.com/~khfeng/lp1850238/

Revision history for this message
Nickolay Ponomarev (asqueella) wrote :

As the original reporter, I feel the obligation to reply: I'm not using the problematic laptop anymore, so if anyone else can test the proposed version - that would be most awesome!

kaihengfeng, thanks for taking a look! I'm curious - are you in contact with / part of the kernel USB team? Could you link to the commit the version you posted was built from? Testing new kernels is rather time-consuming for me due to lack of experience. Understanding the proposed patch and how it can fix the problem would make it much more fun (especially as the last time I checked - December i think? - the mainline kernel had regressed further on this hardware.)

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

https://<email address hidden>/

description: updated
Changed in linux-oem-osp1 (Ubuntu Eoan):
status: New → Won't Fix
Changed in linux-oem-osp1 (Ubuntu Focal):
status: New → Won't Fix
Changed in linux (Ubuntu Bionic):
status: New → Won't Fix
Timo Aaltonen (tjaalton)
Changed in linux-oem-osp1 (Ubuntu Bionic):
status: New → Fix Committed
Revision history for this message
Nickolay Ponomarev (asqueella) wrote :

> https://<email address hidden>/

Much appreciated, thanks!

> Can you guys test this kernel without blacklisting ucsi_ccg: https://people.canonical.com/~khfeng/lp1850238/

I finally got to testing this on the laptop from the original bug report.

Both symptoms - "failed to reset PPM!" on startup and and the 50-second hang during resume - are fixed, thank you!

Resuming from sleep often fails to complete on this system (the screen won't turn on, system not responding), but I haven't yet figured out how to reproduce this, and it also happens without the patch (e.g. https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.4-rc5/ ), so I guess it should be tracked separately.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Eoan):
status: New → Confirmed
Changed in linux-oem-osp1 (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu Eoan):
status: Confirmed → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-eoan' to 'verification-done-eoan'. If the problem still exists, change the tag 'verification-needed-eoan' to 'verification-failed-eoan'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-eoan
tags: added: verification-done-eoan
removed: verification-needed-eoan
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (63.5 KiB)

This bug was fixed in the package linux-oem-osp1 - 5.0.0-1047.52

---------------
linux-oem-osp1 (5.0.0-1047.52) bionic; urgency=medium

  * bionic/linux-oem-osp1: 5.0.0-1047.52 -proposed tracker (LP: #1869351)

  * ucsi_ccg 50 second hang while resuming from s2ram with nvidia, recent
    kernels (LP: #1850238)
    - i2c: nvidia-gpu: Handle timeout correctly in gpu_i2c_check_status()

  * All PS/2 ports on PS/2 Serial add-in bracket are not working after S3
    (LP: #1866734)
    - Revert "UBUNTU: SAUCE: Input: i8042 - Fix the selftest retry logic"
    - SAUCE: Input: i8042 - fix the selftest retry logic

  [ Ubuntu: 5.0.0-45.49 ]

  * disco/linux: 5.0.0-45.49 -proposed tracker (LP: #1868954)
  * Missing wireless network interface after kernel 5.3.0-43 upgrade with eoan
    (LP: #1868442)
    - iwlwifi: mvm: Do not require PHY_SKU NVM section for 3168 devices

  [ Ubuntu: 5.0.0-44.48 ]

  * disco/linux: 5.0.0-44.48 -proposed tracker (LP: #1867284)
  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis
    - [Packaging] update helper scripts
  * Disco update: upstream stable patchset 2020-03-10 (LP: #1866858)
    - Revert "drm/sun4i: dsi: Change the start delay calculation"
    - ovl: fix lseek overflow on 32bit
    - kernel/module: Fix memleak in module_add_modinfo_attrs()
    - media: iguanair: fix endpoint sanity check
    - ocfs2: fix oops when writing cloned file
    - x86/cpu: Update cached HLE state on write to TSX_CTRL_CPUID_CLEAR
    - udf: Allow writing to 'Rewritable' partitions
    - printk: fix exclusive_console replaying
    - iwlwifi: mvm: fix NVM check for 3168 devices
    - sparc32: fix struct ipc64_perm type definition
    - cls_rsvp: fix rsvp_policy
    - gtp: use __GFP_NOWARN to avoid memalloc warning
    - l2tp: Allow duplicate session creation with UDP
    - net: hsr: fix possible NULL deref in hsr_handle_frame()
    - net_sched: fix an OOB access in cls_tcindex
    - net: stmmac: Delete txtimer in suspend()
    - bnxt_en: Fix TC queue mapping.
    - tcp: clear tp->total_retrans in tcp_disconnect()
    - tcp: clear tp->delivered in tcp_disconnect()
    - tcp: clear tp->data_segs{in|out} in tcp_disconnect()
    - tcp: clear tp->segs_{in|out} in tcp_disconnect()
    - rxrpc: Fix use-after-free in rxrpc_put_local()
    - rxrpc: Fix insufficient receive notification generation
    - rxrpc: Fix missing active use pinning of rxrpc_local object
    - rxrpc: Fix NULL pointer deref due to call->conn being cleared on disconnect
    - media: uvcvideo: Avoid cyclic entity chains due to malformed USB descriptors
    - mfd: dln2: More sanity checking for endpoints
    - ipc/msg.c: consolidate all xxxctl_down() functions
    - tracing: Fix sched switch start/stop refcount racy updates
    - rcu: Avoid data-race in rcu_gp_fqs_check_wake()
    - brcmfmac: Fix memory leak in brcmf_usbdev_qinit
    - usb: typec: tcpci: mask event interrupts when remove driver
    - usb: gadget: legacy: set max_speed to super-speed
    - usb: gadget: f_ncm: Use atomic_t to track in-flight request
    - usb: gadget: f_ecm: Use atomic_t to track in-flight request
    - ALSA: usb-audio: Fix endianess in descriptor validation
    - ALSA: dummy: Fix...

Changed in linux-oem-osp1 (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux-oem-osp1 (Ubuntu Eoan):
status: Won't Fix → Fix Released
status: Won't Fix → Fix Released
Revision history for this message
Can Yildirim (canpy30) wrote :

I installed this kernel to check to see if the issue was fixed for me as well:
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.4.28/

I found that it did not solve the problem, I still get the long delay on resume

```
[ 1097.833212] usb 1-7: reset high-speed USB device number 4 using xhci_hcd
[ 1142.739475] ucsi_ccg 0-0008: PPM NOT RESPONDING
[ 1142.739486] PM: dpm_run_callback(): ucsi_ccg_resume+0x0/0x20 [ucsi_ccg] returns -110
[ 1142.739489] PM: Device 0-0008 failed to resume: error -110
```

does this kernel not have the fix in it, if not, What release version contains this fix?

Revision history for this message
Nickolay Ponomarev (asqueella) wrote :

I'm not very familiar with the release process, but judging from the tags on the accepted version of the fix https://github.com/torvalds/linux/commit/d944b27df121e2ee854a6c2fad13d6c6300792d4 the fix in mainline v5.6 and later: https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.6/

Ubuntu's kernels seem to have backported it to:
https://launchpad.net/ubuntu/+source/linux/5.3.0-47.39 for eoan
https://launchpad.net/ubuntu/+source/linux/5.4.0-22.26 for focal

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.3.0-51.44

---------------
linux (5.3.0-51.44) eoan; urgency=medium

  * CVE-2020-11884
    - SAUCE: s390/mm: fix page table upgrade vs 2ndary address mode accesses

 -- Thadeu Lima de Souza Cascardo <email address hidden> Wed, 22 Apr 2020 17:35:41 -0300

Changed in linux (Ubuntu Eoan):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.