amdgpu resume failure: failed to allocate wb slot

Bug #1825074 reported by You-Sheng Yang on 2019-04-17
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
HWE Next
Undecided
Unassigned
linux (Ubuntu)
Undecided
Unassigned
Bionic
Undecided
You-Sheng Yang

Bug Description

[Impact]
Systems with video cards using amdgpu driver may fail to resume due to resource leakage.

[Fix]
73469585510d drm/amdgpu: fix&cleanups for wb_clear

[Test Case]
Verified with fwts for a thounsand runs.

[Regression Risk]
Low. This patch has been included in stable kernel v4.16.y and on, and
it's mostly a trivial bug fix.

==== Original Bug Report ====
[Summary]
When do the S3 stress test with AMD RX550 installed, the system hung after resume from S3 at 112nd S3.

The kernel message:
[ 8120.977916] amdgpu 0000:01:00.0: (-22) failed to allocate wb slot
[ 8120.977941] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 11 (-22).
[ 8120.979662] [drm] ib test on ring 12 succeeded
[ 8120.981952] [drm] ib test on ring 13 succeeded
[ 8120.984578] [drm] ib test on ring 14 succeeded
[ 8120.984813] [drm] ib test on ring 15 succeeded
[ 8120.984825] [drm:amdgpu_device_resume [amdgpu]] *ERROR* ib ring test failed (-22).
[ 8120.997655] [drm] Type 1 DP-HDMI passive dongle 165Mhz:
[ 8121.022465] [drm] 92GH: [Block 0]
[ 8121.022465] [drm] 92GH: [Block 1]
[ 8121.022467] [drm] dc_link_detect: manufacturer_id = B838, product_id = 9202, serial_number = 1, manufacture_week = 29, manufacture_year = 18, display_name = 92GH, speaker_flag = 1, audio_mode_count = 1
[ 8121.022467] [drm] dc_link_detect: mode number = 0, format_code = 1, channel_count = 2, sample_rate = 7, sample_size = 7
[ 8121.022573] PM: resume of devices complete after 412.170 msecs
[ 8121.023076] acpi LNXPOWER:04: Turning OFF
[ 8121.023113] PM: Finishing wakeup.
[ 8121.023114] OOM killer enabled.
[ 8121.023114] Restarting tasks ...
[ 8121.023455] pci_bus 0000:04: Allocating resources
[ 8121.023471] pci 0000:03:00.0: bridge window [io 0x1000-0x0fff] to [bus 04] add_size 1000
[ 8121.023473] pci 0000:03:00.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 04] add_size 200000 add_align 100000
[ 8121.023474] pci 0000:03:00.0: bridge window [mem 0x00100000-0x000fffff] to [bus 04] add_size 200000 add_align 100000
[ 8121.023476] pci 0000:03:00.0: BAR 14: no space for [mem size 0x00200000]
[ 8121.023477] pci 0000:03:00.0: BAR 14: failed to assign [mem size 0x00200000]
[ 8121.023478] pci 0000:03:00.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[ 8121.023478] pci 0000:03:00.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[ 8121.023479] pci 0000:03:00.0: BAR 13: no space for [io size 0x1000]
[ 8121.023479] pci 0000:03:00.0: BAR 13: failed to assign [io size 0x1000]
[ 8121.023481] pci 0000:03:00.0: BAR 14: no space for [mem size 0x00200000]
[ 8121.023481] pci 0000:03:00.0: BAR 14: failed to assign [mem size 0x00200000]
[ 8121.023482] pci 0000:03:00.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[ 8121.023482] pci 0000:03:00.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[ 8121.023483] pci 0000:03:00.0: BAR 13: no space for [io size 0x1000]
[ 8121.023483] pci 0000:03:00.0: BAR 13: failed to assign [io size 0x1000]
[ 8121.023485] pci 0000:03:00.0: PCI bridge to [bus 04]
[ 8121.024358] done.
[ 8121.082344] video LNXVIDEO:00: Restoring backlight state
[ 8121.082346] PM: suspend exit
[ 8121.094634] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
[ 8121.112417] ata4: SATA link down (SStatus 4 SControl 300)
[ 8121.113212] ata3: SATA link down (SStatus 4 SControl 300)
[ 8121.113279] ata2: SATA link down (SStatus 4 SControl 300)
[ 8121.114133] ata1: SATA link down (SStatus 4 SControl 300)
[ 8121.192056] [drm] {1440x900, 1904x934@106500Khz}
[ 8121.282351] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
[ 8121.298481] amdgpu 0000:01:00.0: couldn't schedule ib on ring <sdma1>
[ 8121.298517] [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
[ 8121.298536] [drm:amd_sched_main [amdgpu]] *ERROR* Failed to run job!
[ 8122.183439] [drm] RC6 on
[ 8124.257908] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[ 8124.258035] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
[ 8124.269506] amdgpu 0000:01:00.0: couldn't schedule ib on ring <sdma1>
[ 8124.269539] [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
[ 8124.269558] [drm:amd_sched_main [amdgpu]] *ERROR* Failed to run job!
[ 8125.089361] amdgpu 0000:01:00.0: couldn't schedule ib on ring <sdma1>
[ 8125.089429] [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
[ 8125.089448] [drm:amd_sched_main [amdgpu]] *ERROR* Failed to run job!

[Reproduce Steps]
1. apt-get install -y fwts
2. fwts s3 --s3-multiple=1000 --s3-min-delay=60 --s3-max-delay=60

[Results]
Expected: pass the S3 stress test
Actual: system hung at 112nd S3

[Additional Information]
Kernel Version: 4.15.0-1035-oem
GPU: AMD RX550 (OPGA14) 1002:699f

You-Sheng Yang (vicamo) wrote :
You-Sheng Yang (vicamo) wrote :

commit 97407b63ea60 drm/amdgpu: use 256 bit buffers for all wb allocations (v2)
commit 63ae07ca4fb4 drm/amdgpu:fix wb_clear

These two commits introduced buggy resource management and are latter fixed in commit 73469585510d "drm/amdgpu: fix&cleanups for wb_clear".

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=97407b63ea60
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=63ae07ca4fb4
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=73469585510d

You-Sheng Yang (vicamo) wrote :

This only affects Bionic as Xenial doesn't come with the two commits and Cosmic/Disco have already included the fix.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1825074

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Bionic):
status: New → Incomplete
You-Sheng Yang (vicamo) on 2019-04-17
tags: added: originate-from-1824453 somerville
Changed in linux (Ubuntu Bionic):
assignee: nobody → You-Sheng Yang (vicamo)
Changed in linux (Ubuntu):
status: Incomplete → In Progress
Changed in linux (Ubuntu Bionic):
status: Incomplete → In Progress
You-Sheng Yang (vicamo) wrote :

Patch to dump amdgpu_wb usage. Confirmed amdgpu_wb_free() is called with an offset returned from amdgpu_wb_get(), and yet it skips actual release call because offset is larger than AMDGPU_MAX_WB.

You-Sheng Yang (vicamo) on 2019-04-17
description: updated
tags: added: patch
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
You-Sheng Yang (vicamo) on 2019-04-30
tags: added: verification-done-bionic
removed: verification-needed-bionic
Launchpad Janitor (janitor) wrote :
Download full text (12.6 KiB)

This bug was fixed in the package linux - 4.15.0-50.54

---------------
linux (4.15.0-50.54) bionic; urgency=medium

  * CVE-2018-12126 // CVE-2018-12127 // CVE-2018-12130
    - Documentation/l1tf: Fix small spelling typo
    - x86/cpu: Sanitize FAM6_ATOM naming
    - kvm: x86: Report STIBP on GET_SUPPORTED_CPUID
    - locking/atomics, asm-generic: Move some macros from <linux/bitops.h> to a
      new <linux/bits.h> file
    - tools include: Adopt linux/bits.h
    - x86/msr-index: Cleanup bit defines
    - x86/speculation: Consolidate CPU whitelists
    - x86/speculation/mds: Add basic bug infrastructure for MDS
    - x86/speculation/mds: Add BUG_MSBDS_ONLY
    - x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests
    - x86/speculation/mds: Add mds_clear_cpu_buffers()
    - x86/speculation/mds: Clear CPU buffers on exit to user
    - x86/kvm/vmx: Add MDS protection when L1D Flush is not active
    - x86/speculation/mds: Conditionally clear CPU buffers on idle entry
    - x86/speculation/mds: Add mitigation control for MDS
    - x86/speculation/mds: Add sysfs reporting for MDS
    - x86/speculation/mds: Add mitigation mode VMWERV
    - Documentation: Move L1TF to separate directory
    - Documentation: Add MDS vulnerability documentation
    - x86/speculation/mds: Add mds=full,nosmt cmdline option
    - x86/speculation: Move arch_smt_update() call to after mitigation decisions
    - x86/speculation/mds: Add SMT warning message
    - x86/speculation/mds: Fix comment
    - x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off
    - x86/speculation/mds: Add 'mitigations=' support for MDS

  * CVE-2017-5715 // CVE-2017-5753
    - s390/speculation: Support 'mitigations=' cmdline option

  * CVE-2017-5715 // CVE-2017-5753 // CVE-2017-5754 // CVE-2018-3639
    - powerpc/speculation: Support 'mitigations=' cmdline option

  * CVE-2017-5715 // CVE-2017-5754 // CVE-2018-3620 // CVE-2018-3639 //
    CVE-2018-3646
    - cpu/speculation: Add 'mitigations=' cmdline option
    - x86/speculation: Support 'mitigations=' cmdline option

  * Packaging resync (LP: #1786013)
    - [Packaging] resync git-ubuntu-log

linux (4.15.0-49.53) bionic; urgency=medium

  * linux: 4.15.0-49.53 -proposed tracker (LP: #1826358)

  * Backport support for software count cache flush Spectre v2 mitigation. (CVE)
    (required for POWER9 DD2.3) (LP: #1822870)
    - powerpc/64s: Add support for ori barrier_nospec patching
    - powerpc/64s: Patch barrier_nospec in modules
    - powerpc/64s: Enable barrier_nospec based on firmware settings
    - powerpc: Use barrier_nospec in copy_from_user()
    - powerpc/64: Use barrier_nospec in syscall entry
    - powerpc/64s: Enhance the information in cpu_show_spectre_v1()
    - powerpc/64: Disable the speculation barrier from the command line
    - powerpc/64: Make stf barrier PPC_BOOK3S_64 specific.
    - powerpc/64: Add CONFIG_PPC_BARRIER_NOSPEC
    - powerpc/64: Call setup_barrier_nospec() from setup_arch()
    - powerpc/64: Make meltdown reporting Book3S 64 specific
    - powerpc/lib/code-patching: refactor patch_instruction()
    - powerpc/lib/feature-fixups: use raw_patch_instruction()
    - powerpc/asm: Add a patch_site mac...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for linux-aws has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments