dGPU suspend fails under memory pressure

Bug #2033327 reported by Pierre Equoy
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linux
New
Unknown
linux (Ubuntu)
Triaged
Wishlist
Unassigned

Bug Description

Hardware:

- Desktop PC (Intel i5 4th gen)
- GPU: AMD RX580
- Ubuntu 22.04 with kernel HWE 6.2

Steps to reproduce:

1. Start the device
2. Suspend
3. Resume by pressing a key on the keyboard

Expected results:

The screen lights up, showing the screen to put my password to resume the session.

Actual result:

The screen OSD displays a "No Signal" message before turning off. Nothing seems to happen when I press keyboard keys or move the mouse.

Switching to a TTY by pressing Ctrl+Alt+F3 works: I can login and see the logs (that's how I filed this bug).

Switching back to graphics session (Ctrl+Alt+F2) does't work: it shows the cursor and nothing else (black screen).

Switching to login screen (Ctrl+Alt+F1) works: I can see the list of users, select on and enter my password... but then I'm back to the same black screen with the mouse cursor.

This is not a regression, as the same problem happened already with the previous point release (I believe it was Linux kernel 5.15).

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: linux-image-6.2.0-26-generic 6.2.0-26.26~22.04.1
ProcVersionSignature: Ubuntu 6.2.0-26.26~22.04.1-generic 6.2.13
Uname: Linux 6.2.0-26-generic x86_64
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
CasperMD5CheckResult: pass
Date: Tue Aug 29 09:15:49 2023
InstallationDate: Installed on 2022-12-28 (243 days ago)
InstallationMedia: Ubuntu 22.04.1 LTS "Jammy Jellyfish" - Release amd64 (20220809.1)
SourcePackage: linux-signed-hwe-6.2
UpgradeStatus: No upgrade log present (probably fresh install)
---
ProblemType: Bug
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC3: pieq 2262 F.... pulseaudio
 /dev/snd/controlC1: pieq 2262 F.... pulseaudio
 /dev/snd/controlC0: pieq 2262 F.... pulseaudio
 /dev/snd/controlC2: pieq 2262 F.... pulseaudio
CRDA: N/A
CasperMD5CheckResult: pass
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 22.04
InstallationDate: Installed on 2022-12-28 (245 days ago)
InstallationMedia: Ubuntu 22.04.1 LTS "Jammy Jellyfish" - Release amd64 (20220809.1)
IwConfig:
 lo no wireless extensions.

 enp3s0 no wireless extensions.
MachineType: ASUS All Series
Package: linux (not installed)
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-6.2.0-26-generic root=UUID=40d932e0-3982-4e4a-a5e0-5c5704eca928 ro quiet splash vt.handoff=7
ProcVersionSignature: Ubuntu 6.2.0-26.26~22.04.1-generic 6.2.13
RebootRequiredPkgs: Error: path contained symlinks.
RelatedPackageVersions:
 linux-restricted-modules-6.2.0-26-generic N/A
 linux-backports-modules-6.2.0-26-generic N/A
 linux-firmware 20220329.git681281e4-0ubuntu3.17
RfKill:

StagingDrivers: r8188eu
Tags: jammy staging
Uname: Linux 6.2.0-26-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin lxd plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 03/26/2018
dmi.bios.release: 4.6
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 3602
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: B85M-K
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev X.0x
dmi.chassis.asset.tag: Asset-1234567890
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr3602:bd03/26/2018:br4.6:svnASUS:pnAllSeries:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnB85M-K:rvrRevX.0x:cvnChassisManufacture:ct3:cvrChassisVersion:skuAll:
dmi.product.family: ASUS MB
dmi.product.name: All Series
dmi.product.sku: All
dmi.product.version: System Version
dmi.sys.vendor: ASUS

Revision history for this message
Pierre Equoy (pieq) wrote :
Revision history for this message
Pierre Equoy (pieq) wrote :
Download full text (10.3 KiB)

Attached are the logs for the last boot. As you can see, the device had been running for a week without any problem, but suspending/resuming killed it.

We can see the following stack trace in the logs:

Aug 29 09:12:32 coltrane kernel: kworker/u8:24: page allocation failure: order:5, mode:0x40d00(GFP_NOIO|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
Aug 29 09:12:32 coltrane kernel: CPU: 3 PID: 903164 Comm: kworker/u8:24 Not tainted 6.2.0-26-generic #26~22.04.1-Ubuntu
Aug 29 09:12:32 coltrane kernel: Hardware name: ASUS All Series/B85M-K, BIOS 3602 03/26/2018
Aug 29 09:12:32 coltrane kernel: Workqueue: events_unbound async_run_entry_fn
Aug 29 09:12:32 coltrane kernel: Call Trace:
Aug 29 09:12:32 coltrane kernel: <TASK>
Aug 29 09:12:32 coltrane kernel: dump_stack_lvl+0x48/0x70
Aug 29 09:12:32 coltrane kernel: dump_stack+0x10/0x20
Aug 29 09:12:32 coltrane kernel: warn_alloc+0x14b/0x1c0
Aug 29 09:12:32 coltrane kernel: ? __alloc_pages_direct_compact+0xa7/0x240
Aug 29 09:12:32 coltrane kernel: __alloc_pages_slowpath.constprop.0+0x910/0x990
Aug 29 09:12:32 coltrane kernel: __alloc_pages+0x32c/0x360
Aug 29 09:12:32 coltrane kernel: __kmalloc_large_node+0x89/0x170
Aug 29 09:12:32 coltrane kernel: kmalloc_large+0x21/0xc0
Aug 29 09:12:32 coltrane kernel: dc_set_power_state+0x49/0x190 [amdgpu]
Aug 29 09:12:32 coltrane kernel: dm_suspend+0x9f/0x290 [amdgpu]
Aug 29 09:12:32 coltrane kernel: ? vi_common_set_clockgating_state+0xd7/0x190 [amdgpu]
Aug 29 09:12:32 coltrane kernel: amdgpu_device_ip_suspend_phase1+0xb9/0x1d0 [amdgpu]
Aug 29 09:12:32 coltrane kernel: amdgpu_device_suspend+0xca/0x1d0 [amdgpu]
Aug 29 09:12:32 coltrane kernel: amdgpu_pmops_suspend+0x33/0x50 [amdgpu]
Aug 29 09:12:32 coltrane kernel: pci_pm_suspend+0x8a/0x1c0
Aug 29 09:12:32 coltrane kernel: ? __pfx_pci_pm_suspend+0x10/0x10
Aug 29 09:12:32 coltrane kernel: dpm_run_callback+0x54/0x1a0
Aug 29 09:12:32 coltrane kernel: __device_suspend+0x14b/0x400
Aug 29 09:12:32 coltrane kernel: async_suspend+0x1f/0x80
Aug 29 09:12:32 coltrane kernel: async_run_entry_fn+0x33/0x130
Aug 29 09:12:32 coltrane kernel: process_one_work+0x21f/0x440
Aug 29 09:12:32 coltrane kernel: worker_thread+0x50/0x3f0
Aug 29 09:12:32 coltrane kernel: ? __pfx_worker_thread+0x10/0x10
Aug 29 09:12:32 coltrane kernel: kthread+0xee/0x120
Aug 29 09:12:32 coltrane kernel: ? __pfx_kthread+0x10/0x10
Aug 29 09:12:32 coltrane kernel: ret_from_fork+0x2c/0x50
Aug 29 09:12:32 coltrane kernel: </TASK>
Aug 29 09:12:32 coltrane kernel: Mem-Info:
Aug 29 09:12:32 coltrane kernel: active_anon:6 inactive_anon:2748750 isolated_anon:0
                                  active_file:309 inactive_file:232 isolated_file:0
                                  unevictable:41 dirty:1 writeback:8
                                  slab_reclaimable:60650 slab_unreclaimable:80021
                                  mapped:1 shmem:70425 pagetables:29671
                                  sec_pagetables:0 bounce:0
                                  kernel_misc_reclaimable:0
                                  free:106406 free_pcp:0 free_cma:0
Aug 29 09:12:32 coltrane kernel: Node 0 active_anon:24kB inactive_anon:10995000...

Revision history for this message
Pierre Equoy (pieq) wrote :

Attaching GPU-related logs after reboooting the device.

Outputs from:

modinfo amdgpu > modinfo.amdgpu.log
sudo lshw > lshw.log
sudo dmidecode > dmidecode.log
lspci -nn > lspci.nn.log
lspci -vnn > lspci.vnn.log
cp /var/log/Xorg.0.log .
lsmod | grep amdgpu > lsmod.amdgpu.log

Timo Aaltonen (tjaalton)
affects: linux-signed-hwe-6.2 (Ubuntu) → linux (Ubuntu)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2033327

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Juerg Haefliger (juergh) wrote : Re: Desktop cannot resume after suspend: screen not detected
Download full text (4.5 KiB)

Aug 29 09:12:32 coltrane kernel: ------------[ cut here ]------------
Aug 29 09:12:32 coltrane kernel: WARNING: CPU: 3 PID: 903164 at drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:4277 dc_set_power_state+0x186/0x190 [amdgpu]
Aug 29 09:12:32 coltrane kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs snd_seq_dummy uas usb_storage vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel sunrpc kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel mei_hdcp mei_pxp sha512_ssse3 aesni_intel binfmt_misc crypto_simd amdgpu cryptd nls_iso8859_1 rapl snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi intel_cstate uvcvideo videobuf2_vmalloc snd_hda_intel videobuf2_memops videobuf2_v4l2 snd_intel_dspcfg snd_intel_sdw_acpi videodev snd_hda_codec eeepc_wmi snd_usb_audio videobuf2_common wmi_bmof input_leds snd_hda_core snd_usbmidi_lib mc snd_hwdep joydev snd_seq_midi snd_pcm snd_seq_midi_event at24 iommu_v2 drm_buddy gpu_sched drm_ttm_helper snd_rawmidi ttm drm_display_helper snd_seq cec cmdlinepart spi_nor rc_core snd_seq_device drm_kms_helper mtd
Aug 29 09:12:32 coltrane kernel: i2c_algo_bit snd_timer syscopyarea snd sysfillrect sysimgblt soundcore mei_me mei mac_hid sch_fq_codel msr parport_pc ppdev ramoops pstore_blk lp reed_solomon pstore_zone drm parport efi_pstore ip_tables x_tables autofs4 hid_generic usbhid hid mfd_aaeon asus_wmi ledtrig_audio spi_intel_platform sparse_keymap spi_intel platform_profile r8169 ahci crc32_pclmul realtek i2c_i801 i2c_smbus libahci lpc_ich xhci_pci xhci_pci_renesas video wmi
Aug 29 09:12:32 coltrane kernel: CPU: 3 PID: 903164 Comm: kworker/u8:24 Not tainted 6.2.0-26-generic #26~22.04.1-Ubuntu
Aug 29 09:12:32 coltrane kernel: Hardware name: ASUS All Series/B85M-K, BIOS 3602 03/26/2018
Aug 29 09:12:32 coltrane kernel: Workqueue: events_unbound async_run_entry_fn
Aug 29 09:12:32 coltrane kernel: RIP: 0010:dc_set_power_state+0x186/0x190 [amdgpu]
Aug 29 09:12:32 coltrane kernel: Code: bc 24 98 f4 01 00 49 8b 84 24 80 f3 01 00 49 8d 94 24 98 03 00 00 4c 89 e6 e8 06 0f 37 cb e9 47 ff ff ff 0f 0b e9 b4 fe ff ff <0f> 0b e9 39 ff ff ff 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90
Aug 29 09:12:32 coltrane kernel: RSP: 0018:ffffb9b98bde3c40 EFLAGS: 00010246
Aug 29 09:12:32 coltrane kernel: RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000000000
Aug 29 09:12:32 coltrane kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Aug 29 09:12:32 coltrane kernel: RBP: ffffb9b98bde3c60 R08: 0000000000000000 R09: 0000000000000000
Aug 29 09:12:32 coltrane kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff922043d60000
Aug 29 09:12:32 coltrane kernel: R13: 0000000000000000 R14: ffffffffc126a410 R15: ffff9220537e0000
Aug 29 09:12:32 coltrane kernel: FS: 0000000000000000(0000) GS:ffff92234ed80000(0000) knlGS:0000000000000000
Aug 29 09:12:32 coltrane kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 09:12:32 coltrane kernel: CR2: 00005602366bd746 CR3: 0000000180210006 CR4: 00000000001706e0
Aug 29 09:12:3...

Read more...

Revision history for this message
Mario Limonciello (superm1) wrote :

I don't see the full logs for your failure in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2033327/+attachment/5695799/+files/journalctl-b0-20230829.log, I'm interested in the stuff outside of the trace itself.

But I /suspect/ from that trace alone that you were out of memory at suspend time. This is supposed to be prevented by '8d4de331f1b2 ("drm/amd: Fail the suspend if resources can't be evicted")' but if that's indeed the cause need to see the rest of the context to explain it.

Revision history for this message
Pierre Equoy (pieq) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected staging
description: updated
Revision history for this message
Pierre Equoy (pieq) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : Lspci.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : Lspci-vt.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : Lsusb.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : Lsusb-t.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : Lsusb-v.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : PaInfo.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : ProcEnviron.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : ProcModules.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : PulseList.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : UdevDb.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : WifiSyslog.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote : acpidump.txt

apport information

Revision history for this message
Pierre Equoy (pieq) wrote (last edit ): Re: Desktop cannot resume after suspend: screen not detected

@superm1

Thanks for your feedback! Here is the full journal from that boot.

Edit: ah crap, it's the same file, I thought I had originally sent an edited version with just the stack trace... not sure what else to provide.

Revision history for this message
Mario Limonciello (superm1) wrote :

Can you just manually fetch the journal from the failed boot and attach that?

Revision history for this message
Pierre Equoy (pieq) wrote :
Download full text (7.7 KiB)

That's what I attached. When I could not get back to a proper graphical environment, I switched to a TTY, ran `sudo journalctl -b0 > journalctl-b0-20230829.log`, and I attached it to that bug. As you can see in the logs, the device had been running since August 21, and on Aug 28 evening I suspended it:

Aug 28 19:34:36 coltrane systemd[1]: Reached target Sleep.
Aug 28 19:34:36 coltrane systemd[1]: Starting Record successful boot for GRUB...
Aug 28 19:34:36 coltrane systemd[1]: Starting Restart Syncthing after resume...
Aug 28 19:34:36 coltrane systemd[1]: Starting System Suspend...
Aug 28 19:34:36 coltrane systemd-sleep[903135]: Entering sleep state 'suspend'...
Aug 28 19:34:36 coltrane kernel: PM: suspend entry (deep)
Aug 28 19:34:36 coltrane systemd[1]: grub-common.service: Deactivated successfully.
Aug 28 19:34:36 coltrane systemd[1]: Finished Record successful boot for GRUB.
Aug 28 19:34:36 coltrane systemd[1]: Starting GRUB failed boot detection...

and then resumed it on Aug 29 at 09:12, and this is when the stack trace appears:

Aug 29 09:12:26 coltrane kernel: Filesystems sync: 0.018 seconds
Aug 29 09:12:31 coltrane kernel: Freezing user space processes
Aug 29 09:12:31 coltrane kernel: Freezing user space processes completed (elapsed 0.003 seconds)
Aug 29 09:12:31 coltrane kernel: OOM killer disabled.
Aug 29 09:12:32 coltrane kernel: Freezing remaining freezable tasks
Aug 29 09:12:32 coltrane kernel: Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
Aug 29 09:12:32 coltrane kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Aug 29 09:12:32 coltrane kernel: sd 1:0:0:0: [sda] Synchronizing SCSI cache
Aug 29 09:12:32 coltrane kernel: sd 1:0:0:0: [sda] Stopping disk
Aug 29 09:12:32 coltrane kernel: sd 2:0:0:0: [sdb] Synchronizing SCSI cache
Aug 29 09:12:32 coltrane kernel: sd 3:0:0:0: [sdc] Synchronizing SCSI cache
Aug 29 09:12:32 coltrane kernel: sd 4:0:0:0: [sdd] Synchronizing SCSI cache
Aug 29 09:12:32 coltrane kernel: sd 2:0:0:0: [sdb] Stopping disk
Aug 29 09:12:32 coltrane kernel: sd 4:0:0:0: [sdd] Stopping disk
Aug 29 09:12:32 coltrane kernel: sd 3:0:0:0: [sdc] Stopping disk
(...)
Aug 29 09:12:32 coltrane kernel: kworker/u8:24: page allocation failure: order:5, mode:0x40d00(GFP_NOIO|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
Aug 29 09:12:32 coltrane kernel: CPU: 3 PID: 903164 Comm: kworker/u8:24 Not tainted 6.2.0-26-generic #26~22.04.1-Ubuntu
Aug 29 09:12:32 coltrane kernel: Hardware name: ASUS All Series/B85M-K, BIOS 3602 03/26/2018
Aug 29 09:12:32 coltrane kernel: Workqueue: events_unbound async_run_entry_fn
Aug 29 09:12:32 coltrane kernel: Call Trace:
Aug 29 09:12:32 coltrane kernel: <TASK>
Aug 29 09:12:32 coltrane kernel: dump_stack_lvl+0x48/0x70
Aug 29 09:12:32 coltrane kernel: dump_stack+0x10/0x20
Aug 29 09:12:32 coltrane kernel: warn_alloc+0x14b/0x1c0
Aug 29 09:12:32 coltrane kernel: ? __alloc_pages_direct_compact+0xa7/0x240
Aug 29 09:12:32 coltrane kernel: __alloc_pages_slowpath.constprop.0+0x910/0x990
Aug 29 09:12:32 coltrane kernel: __alloc_pages+0x32c/0x360
Aug 29 09:12:32 coltrane kernel: __kmalloc_large_node+0x89/0x170
Aug 29 09:12:32 coltrane kernel...

Read more...

Revision history for this message
Mario Limonciello (superm1) wrote :

I don't know if launchpad messed it up, but I'm not seeing anything past the 28th.

Revision history for this message
Pierre Equoy (pieq) wrote :

That's weird... I downloaded the last attachment I sent and checked it with vim and I can see past the 28th...

Anyway, here is another attempt. Same log, xzipped. I checked and the last line of the log is on Aug 29 09:26:34.

Pierre Equoy (pieq)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
assignee: nobody → Mario Limonciello (superm1)
Revision history for this message
Mario Limonciello (superm1) wrote :

Got it; so what happened is there is a separate malloc call elsewhere in the amdgpu display code that failed due to the low memory situation. Is this a "one time thing" or is this a regular occurrence that you can reproduce in how you use your machine? I would have to look at context to see if it made sense to reserve that memory all the time, but if this is the only place that explodes maybe we can store a local variable in the amdgpu device structure for the thing that is malloc'ed.

Revision history for this message
James Phillips (78luphr0rnk2nuqimstywepozxn9kl19tqh0tx66b5dki1xxsh5mkz9gl21a5rlwfnr8jn6ln0m3jxne2k9x1ohg85w3jabxlrqbgszpjpwcmvkbcvq9spp6z3w5j1m33k06t-launchpad-a811i2i3ytqlsztthjth0svbccw8inm65tmkqp9sarr553jq53in4xm1m8wn3o4rlwaer06ogwvqwv9mrqoku2x334n7di44o65qze67n1wneepmidnuwnde1rqcbpgdf70gt) wrote (last edit ):

Message #26:
"Aug 29 09:12:32 coltrane kernel: sd 2:0:0:0: [sdb] Stopping disk
Aug 29 09:12:32 coltrane kernel: sd 4:0:0:0: [sdd] Stopping disk
Aug 29 09:12:32 coltrane kernel: sd 3:0:0:0: [sdc] Stopping disk
(...)
Aug 29 09:12:32 coltrane kernel: kworker/u8:24: page allocation failure: order:5, mode:0x40d00(GFP_NOIO|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0"

Confirms that it is probably upstream bug number 2362
https://gitlab.freedesktop.org/drm/amd/-/issues/2362

The issue is that the GPU driver tries to dump the contents of VRAM to system RAM upon suspend. However, this is not able to happen because the disks (with swap) are suspended first. It appears that the solution is likely to dump the VRAM contents during the "pre-suspend" hook. Incidentally the bug appear to date back to at least 2010: but VRAM was a lot smaller compared to system RAM back then.

I got the same errors: but was able to resume from suspend for some reason. I *suspect* it is because I am using the Rocm GPU drivers. Edit: Another non-standard thing I did was up vm.min_free_kbytes to 393216 (384MiB - 64MiB per core) from the default of 64MiB. I also enabled zswap with lz4/z3fold (merging two different guides) and disabled deduplication (made system unusable under extreme swap testing with 'mprime'; with dedup disabled the system was kinda-sorta usable (still 80% waiting on disk) with over 75% of RAM in use).

Revision history for this message
James Phillips (78luphr0rnk2nuqimstywepozxn9kl19tqh0tx66b5dki1xxsh5mkz9gl21a5rlwfnr8jn6ln0m3jxne2k9x1ohg85w3jabxlrqbgszpjpwcmvkbcvq9spp6z3w5j1m33k06t-launchpad-a811i2i3ytqlsztthjth0svbccw8inm65tmkqp9sarr553jq53in4xm1m8wn3o4rlwaer06ogwvqwv9mrqoku2x334n7di44o65qze67n1wneepmidnuwnde1rqcbpgdf70gt) wrote (last edit ):

Kern.log with similar suspend messages. I was looking at the logs because my game of Euro Truck simulator 2 appeared to crash: but in hindsight it was probably just the video driver that crashed. Steam was still running right up until reboot. The game log was truncated by 4 minutes: despite logging into a terminal and typing 'sync' before "sysctl reboot -i".

When the video driver crashes I was unable to use the desktop environment: but the display manager seems to work.

$ uname -a
Linux cathy 5.15.0-87-generic #97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

I am using the AMD Rocm 5.6 video drivers. (The last ones supporting my Vega 56 card, apparently).

The actual video crash happens in this part of the log:
Oct 20 10:47:59 cathy kernel: [67779.641047] amdgpu 0000:06:00.0: amdgpu: IH ring buffer overflow (0x00080C60, 0x000002C0, 0x00000C80)
Oct 20 10:47:59 cathy kernel: [67779.753369] [drm] ring 0 timeout to preempt ib
Oct 20 10:48:09 cathy kernel: [67789.894431] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=1282275, emitted seq
=1282276
Oct 20 10:48:09 cathy kernel: [67789.895531] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1994 thread
 gnome-shel:cs0 pid 2002

Edit: to add: I had spent a 100+ hours "testing" the game the past two weeks. The suspend related error messages did not occur until I "fixed" the game crashing problem with a work-around where you pass the kernel module parameter amdgpu.noretry=0 on the kernel command line. I suspect the suspend problem manifest because:
1. Played a 7+ hour marathon session to make sure the work-around worked (the GPU page faults still happen with the work-around: they are just of the form "retry page fault" instead of "no-retry page fault").
2. Let steam run overnight with suspend inhibited. 2 game updates were installed during that time.

This served to deplete the available memory when I actually got around to suspending the machine.

Revision history for this message
Mario Limonciello (superm1) wrote :

You're totally right that there is a problem when memory pressure is high that the suspend sequence can't evict the RAM.

This is upstream https://gitlab.freedesktop.org/drm/amd/-/issues/2362

A bunch of changes are going into 6.7 that will move the parts that allocate memory into the prepare() pmops sequence. This will "fix" the immediate problem in that the suspend itself will "return" an error code.

There is a secondary problem though that the PM core disables swap "too soon", so if memory is low it can't be moved into swap.

This secondary problem can either be fixed by PM core either moving the timing of the swap disable or evicting some other usage (userspace?) into swap before letting the prepare()/suspend() sequence start.

Changed in linux (Ubuntu):
importance: Undecided → Wishlist
summary: - Desktop cannot resume after suspend: screen not detected
+ dGPU suspend fails under memory pressure
Changed in linux (Ubuntu):
status: Confirmed → Triaged
assignee: Mario Limonciello (superm1) → nobody
Changed in linux:
status: Unknown → New
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.