Ubuntu18.04: GPU total memory is reduced

Bug #1792102 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
Critical
Canonical Kernel Team
linux (Ubuntu)
Fix Released
Critical
Joseph Salisbury
Bionic
Fix Released
Critical
Joseph Salisbury

Bug Description

== SRU Justification ==
Due to a recent change for powernv, now the total GPU memory is no longer
available. This impacts performance for any application/benchmark has a
large GPU memory utilization.

IBM is requesting mainlien commit 7acf50e4efa6 in Bionic, which reverts
mainline commit 4b5d62ca17a1.

== Fix ==
7acf50e4efa6 ("Revert "powerpc/powernv: Increase memory block size to 1GB on radix"")

== Regression Potential ==
Low. This is a revert that was also done upstream due to a regression.
Limited to powerpc.

== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

== Comment: #0 - Michael Ranweiler <email address hidden> - 2018-09-10 19:26:14 ==

Due to a recent change for powernv, now the total GPU memory is no longer available. This impacts performance for any application/benchmark has a large GPU memory utilization.

Previous amount of memory : 16128MiB
Current amount of available memory : 15360MiB

From Anton, describing the recent change.:
   powerpc/powernv: Increase memory block size to 1GB on radix

  Memory hot unplug on PowerNV radix hosts is broken. Our memory block
  size is 256MB but since we map the linear region with very large
  pages, each pte we tear down maps 1GB.

  A hot unplug of one 256MB memory block results in 768MB of memory
  getting unintentionally unmapped. At this point we are likely to oops.

  Fix this by increasing our memory block size to 1GB on PowerNV radix
  hosts.

  Fixes: 4b5d62ca17a1 ("powerpc/mm: add radix__remove_section_mapping()")

This is fixed with:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7acf50e4efa6

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-171272 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu):
status: New → In Progress
Changed in linux (Ubuntu Bionic):
status: New → In Progress
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commit 7acf50e4efa6. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1792102

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: Triaged → In Progress
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-09-25 15:52 EDT-------
I haven't been able to verify this because I haven't been able to build the Nvidia GPU module against this kernel. If I try to use dkms to build it I get:
/bin/sh: 1: scripts/basic/fixdep: Exec format error
scripts/Makefile.build:332: recipe for target '/var/lib/dkms/nvidia/410.37/build/nvidia/nv-mempool.o' failed
make[2]: *** [/var/lib/dkms/nvidia/410.37/build/nvidia/nv-mempool.o] Error 2
make[2]: *** Waiting for unfinished jobs....

This has worked fine on the kernels I've built locally as well as the standard (or the -proposed kernel put out), but this kernel doesn't work. I also tried the test kernels from lp1790636 since that had the same build host and I get the same failure with them.

Any ideas? Any other way to validate it?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-09-26 13:11 EDT-------
I had to rebuild the utilities in scripts since those were the problem, they were x86 binaries.

The problem is fixed with this kernel:
user@deb3qwsp1:~$ cat /proc/version
Linux version 4.15.0-34-generic (jsalisbury@kathleen) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #38~lp1792102 SMP Wed Sep 12 19:55:58 UTC 2018
user@deb3qwsp1:~$ nvidia-smi |grep -A3 Memory
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 31C P0 34W / 300W | 0MiB / 16128MiB | 0% Default |

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
description: updated
Changed in ubuntu-power-systems:
importance: High → Critical
Manoj Iyer (manjo)
Changed in linux (Ubuntu):
importance: High → Critical
Changed in linux (Ubuntu Bionic):
importance: High → Critical
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Mike Ranweiler (mranweil) wrote :

I tested this against -proposed and it's fixed, thank you!

user@deb3qwsp1:~/gdrcopy$ cat /proc/version
Linux version 4.15.0-39-generic (buildd@bos02-ppc64el-016) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #42-Ubuntu SMP Tue Oct 23 15:41:45 UTC 2018
user@deb3qwsp1:~/gdrcopy$ nvidia-smi |grep -A3 Memory
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 31C P0 34W / 300W | 6MiB / 16128MiB | 0% Default |

tags: added: verification-done-bionic
removed: verification-needed-bionic
Frank Heimes (fheimes)
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (5.4 KiB)

This bug was fixed in the package linux - 4.15.0-39.42

---------------
linux (4.15.0-39.42) bionic; urgency=medium

  * linux: 4.15.0-39.42 -proposed tracker (LP: #1799411)

  * Linux: insufficient shootdown for paging-structure caches (LP: #1798897)
    - mm: move tlb_table_flush to tlb_flush_mmu_free
    - mm/tlb: Remove tlb_remove_table() non-concurrent condition
    - mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
    - [Config] CONFIG_HAVE_RCU_TABLE_INVALIDATE=y

  * Ubuntu18.04: GPU total memory is reduced (LP: #1792102)
    - Revert "powerpc/powernv: Increase memory block size to 1GB on radix"

  * arm64: snapdragon: reduce boot noise (LP: #1797154)
    - [Config] arm64: snapdragon: DRM_MSM=m
    - [Config] arm64: snapdragon: SND*=m
    - [Config] arm64: snapdragon: disable ARM_SDE_INTERFACE
    - [Config] arm64: snapdragon: disable DRM_I2C_ADV7511_CEC
    - [Config] arm64: snapdragon: disable VIDEO_ADV7511, VIDEO_COBALT

  * [Bionic] CPPC bug fixes (LP: #1796949)
    - ACPI / CPPC: Update all pr_(debug/err) messages to log the susbspace id
    - cpufreq: CPPC: Don't set transition_latency
    - ACPI / CPPC: Fix invalid PCC channel status errors

  * regression in 'ip --family bridge neigh' since linux v4.12 (LP: #1796748)
    - rtnetlink: fix rtnl_fdb_dump() for ndmsg header

  * screen displays abnormally on the lenovo M715 with the AMD GPU (Radeon Vega
    8 Mobile, rev ca, 1002:15dd) (LP: #1796786)
    - drm/amd/display: Fix takover from VGA mode
    - drm/amd/display: early return if not in vga mode in disable_vga
    - drm/amd/display: Refine disable VGA

  * arm64: snapdragon: WARNING: CPU: 0 PID: 1 arch/arm64/kernel/setup.c:271
    reserve_memblock_reserved_regions (LP: #1797139)
    - SAUCE: arm64: Fix /proc/iomem for reserved but not memory regions

  * The front MIC can't work on the Lenovo M715 (LP: #1797292)
    - ALSA: hda/realtek - Fix the problem of the front MIC on the Lenovo M715

  * Keyboard backlight sysfs sometimes is missing on Dell laptops (LP: #1797304)
    - platform/x86: dell-smbios: Correct some style warnings
    - platform/x86: dell-smbios: Rename dell-smbios source to dell-smbios-base
    - platform/x86: dell-smbios: Link all dell-smbios-* modules together
    - [Config] CONFIG_DELL_SMBIOS_SMM=y, CONFIG_DELL_SMBIOS_WMI=y

  * rpi3b+: ethernet not working (LP: #1797406)
    - lan78xx: Don't reset the interface on open

  * 87cdf3148b11 was never backported to 4.15 (LP: #1795653)
    - xfrm: Verify MAC header exists before overwriting eth_hdr(skb)->h_proto

  * [Ubuntu18.04][Power9][DD2.2]package installation segfaults inside debian
    chroot env in P9 KVM guest with HTM enabled (kvm) (LP: #1792501)
    - KVM: PPC: Book3S HV: Fix guest r11 corruption with POWER9 TM workarounds

  * Provide mode where all vCPUs on a core must be the same VM (LP: #1792957)
    - KVM: PPC: Book3S HV: Provide mode where all vCPUs on a core must be the same
      VM

  * fscache: bad refcounting in fscache_op_complete leads to OOPS (LP: #1797314)
    - SAUCE: fscache: Fix race in decrementing refcount of op->npages

  * CVE-2018-9363
    - Bluetooth: hidp: buffer overflow in hidp_process_report

  * CVE-20...

Read more...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
bugproxy (bugproxy)
tags: added: targetmilestone-inin18041
removed: targetmilestone-inin---
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.