performance drop with ATS enabled

Bug #1788097 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
High
Canonical Kernel Team
linux (Ubuntu)
Fix Released
High
Joseph Salisbury
Bionic
Fix Released
High
Joseph Salisbury

Bug Description

== Comment: #0 - Michael Ranweiler <email address hidden> - 2018-08-16 09:58:02 ==

Witherspoon cluster now has ATS enabled with driver 396.42, CUDA version 9.2.148. They are running the CORAL benchmark LULESH with and without ATS, and they see a significant performance drop with ATS enabled.

========
Below is the run with ATS:

Run completed:
Problem size = 160
MPI tasks = 8
Iteration count = 100
Final Origin Energy = 1.605234e+09
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 2.384186e-07
TotalAbsDiff = 5.300015e-07
MaxRelDiff = 1.631916e-12

Elapsed time = 153.00 (s)
Grind time (us/z/c) = 0.37352393 (per dom) (0.046690491 overall)
FOM = 21417.637 (z/s)

========
Here is the run without ATS:
Run completed:
Problem size = 160
MPI tasks = 8
Iteration count = 100
Final Origin Energy = 1.605234e+09
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 2.384186e-07
TotalAbsDiff = 5.300015e-07
MaxRelDiff = 1.631916e-12

Elapsed time = 13.27 (s)
Grind time (us/z/c) = 0.032394027 (per dom) (0.0040492534 overall)
FOM = 246959.11 (z/s)
========

Using ATS on a single node slows down the OpenACC version more than 10 times, and for the version with OpenMP 4.5 and managed memory, they observe a 2x slowdown.

Last comment from NVIDIA (Javier Cabezas - 07/29/2018 11:30 AM):
We think we have found where's the issue.

This behavior reproduces for any two concurrent processes that create CUDA contexts on the GPUs and heavily unmap memory (no need to launch any work on the GPUs). When the problem repros, perf shows that most of the time is spent in mmio_invalidate. However, this only happens when processes register GPUs attached to the same NPU. Thus, if process A, initializes GPU 0 and/or 1, and process B, initializes GPU 2 and/or 3, we don't see the slowdown. This makes sense, because ATSDs on different NPUs are issued independently.

After some code inspection in npu-dma.c (powerpc backend in the Linux kernel), Mark noticed that the problem could be in the utilization of test_and_set_bit_lock in get_mmio_atsd_reg. The implementation of test_and_set_bit_lock in powerpc relies on the ldarx/stdcx instructions (PPC_LLARX/PPC_STLCX in the snippet below):

#define DEFINE_TESTOP(fn, op, prefix, postfix, eh) \
static __inline__ unsigned long fn( \
        unsigned long mask, \
        volatile unsigned long *_p) \
{ \
    unsigned long old, t; \
    unsigned long *p = (unsigned long *)_p; \
    __asm__ __volatile__ ( \
    prefix \
"1:" PPC_LLARX(%0,0,%3,eh) "\n" \
    stringify_in_c(op) "%1,%0,%2\n" \
    PPC405_ERR77(0,%3) \
    PPC_STLCX "%1,0,%3\n" \
    "bne- 1b\n" \
    postfix \
    : "=&r" (old), "=&r" (t) \
    : "r" (mask), "r" (p) \
    : "cc", "memory"); \
    return (old & mask); \
}

According to the PowerPC manual, ldarx creates a memory reservation and a subsequent stwcx instruction from the same processor ensures an atomic read-modify-write operation. However, the reservation can be lost if a different processor executes any store instruction on the same address. That's why "bne- 1b" checks wether stwcx was successful and jumps back to retry, otherwise. Since DEFINE_TESTOP doesn't implement any back-off mechanism, two different processors trying to get an ATSD register can starve each other.

Mark compiled a custom kernel which surrounds the calls to test_and_set_bit_lock in get_mmio_atsd_reg with a spinlock and I verified that it solves the issue. These are the execution times for LULESH:

ATS OFF
Elapsed time = 16.87 (s)

ATS ON
Elapsed time = 215.56 (s)

ATS ON + Spinlock
Elapsed time = 18.14 (s)

Fixed with the following patch in the powerpc tree:
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=9eab9901b015

== Comment: #1 - Michael Ranweiler <email address hidden> - 2018-08-20 14:56:52 ==
This is now in mainline, too:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/platforms/powernv/npu-dma.c?id=9eab9901b015f489199105c470de1ffc337cfabb

It has some small fuzz to apply to 4.15.0.32-35:

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 6c8e168e6571..18226895681e 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -434,8 +434,9 @@ static int get_mmio_atsd_reg(struct npu *npu)
        int i;

        for (i = 0; i < npu->mmio_atsd_count; i++) {
- if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
- return i;
+ if (!test_bit(i, &npu->mmio_atsd_usage))
+ if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
+ return i;
        }

        return -ENOSPC;

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-170624 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in ubuntu-power-systems:
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g
Changed in ubuntu-power-systems:
importance: Undecided → High
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
Changed in linux (Ubuntu):
status: New → Triaged
Changed in linux (Ubuntu Bionic):
status: New → Triaged
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commit 9eab9901b015. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1788097

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
* If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
* If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

Changed in linux (Ubuntu):
status: Triaged → In Progress
Changed in linux (Ubuntu Bionic):
status: Triaged → In Progress
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → In Progress
Seth Forshee (sforshee)
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-09-02 13:33 EDT-------
Kris Murphy recreated the ATS enabled driver performance issue with 18.04.1 and then loaded the test kernel from the bug below and verified that it fixes the issue.
Bz 170624 ? NV 2282038 performance drop with ATS enabled
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1788097

Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Mike Ranweiler (mranweil) wrote :

We verified this with the imagenet data and the proposed kernel and it looks good. Thanks!

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (23.5 KiB)

This bug was fixed in the package linux - 4.15.0-36.39

---------------
linux (4.15.0-36.39) bionic; urgency=medium

  * CVE-2018-14633
    - iscsi target: Use hex2bin instead of a re-implementation

  * CVE-2018-17182
    - mm: get rid of vmacache_flush_all() entirely

linux (4.15.0-35.38) bionic; urgency=medium

  * linux: 4.15.0-35.38 -proposed tracker (LP: #1791719)

  * device hotplug of vfio devices can lead to deadlock in vfio_pci_release
    (LP: #1792099)
    - SAUCE: vfio -- release device lock before userspace requests

  * L1TF mitigation not effective in some CPU and RAM combinations
    (LP: #1788563)
    - x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit
    - x86/speculation/l1tf: Fix off-by-one error when warning that system has too
      much RAM
    - x86/speculation/l1tf: Increase l1tf memory limit for Nehalem+

  * CVE-2018-15594
    - x86/paravirt: Fix spectre-v2 mitigations for paravirt guests

  * CVE-2017-5715 (Spectre v2 s390x)
    - KVM: s390: implement CPU model only facilities
    - s390: detect etoken facility
    - KVM: s390: add etoken support for guests
    - s390/lib: use expoline for all bcr instructions
    - s390: fix br_r1_trampoline for machines without exrl
    - SAUCE: s390: use expoline thunks for all branches generated by the BPF JIT

  * Ubuntu18.04.1: cpuidle: powernv: Fix promotion from snooze if next state
    disabled (performance) (LP: #1790602)
    - cpuidle: powernv: Fix promotion from snooze if next state disabled

  * Watchdog CPU:19 Hard LOCKUP when kernel crash was triggered (LP: #1790636)
    - powerpc: hard disable irqs in smp_send_stop loop
    - powerpc: Fix deadlock with multiple calls to smp_send_stop
    - powerpc: smp_send_stop do not offline stopped CPUs
    - powerpc/powernv: Fix opal_event_shutdown() called with interrupts disabled

  * Security fix: check if IOMMU page is contained in the pinned physical page
    (LP: #1785675)
    - vfio/spapr: Use IOMMU pageshift rather than pagesize
    - KVM: PPC: Check if IOMMU page is contained in the pinned physical page

  * Missing Intel GPU pci-id's (LP: #1789924)
    - drm/i915/kbl: Add KBL GT2 sku
    - drm/i915/whl: Introducing Whiskey Lake platform
    - drm/i915/aml: Introducing Amber Lake platform
    - drm/i915/cfl: Add a new CFL PCI ID.

  * CVE-2018-15572
    - x86/speculation: Protect against userspace-userspace spectreRSB

  * Support Power Management for Thunderbolt Controller (LP: #1789358)
    - thunderbolt: Handle NULL boot ACL entries properly
    - thunderbolt: Notify userspace when boot_acl is changed
    - thunderbolt: Use 64-bit DMA mask if supported by the platform
    - thunderbolt: Do not unnecessarily call ICM get route
    - thunderbolt: No need to take tb->lock in domain suspend/complete
    - thunderbolt: Use correct ICM commands in system suspend
    - thunderbolt: Add support for runtime PM

  * random oopses on s390 systems using NVMe devices (LP: #1790480)
    - s390/pci: fix out of bounds access during irq setup

  * [Bionic] Spectre v4 mitigation (Speculative Store Bypass Disable) support
    for arm64 using SMC firmware call to set a hardware chicken bit
    (LP: #1787993) // CVE-2018...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (29.0 KiB)

This bug was fixed in the package linux - 4.18.0-8.9

---------------
linux (4.18.0-8.9) cosmic; urgency=medium

  * linux: 4.18.0-8.9 -proposed tracker (LP: #1791663)

  * Cosmic update to v4.18.7 stable release (LP: #1791660)
    - rcu: Make expedited GPs handle CPU 0 being offline
    - net: 6lowpan: fix reserved space for single frames
    - net: mac802154: tx: expand tailroom if necessary
    - 9p/net: Fix zero-copy path in the 9p virtio transport
    - spi: davinci: fix a NULL pointer dereference
    - spi: pxa2xx: Add support for Intel Ice Lake
    - spi: spi-fsl-dspi: Fix imprecise abort on VF500 during probe
    - spi: cadence: Change usleep_range() to udelay(), for atomic context
    - mmc: block: Fix unsupported parallel dispatch of requests
    - mmc: renesas_sdhi_internal_dmac: mask DMAC interrupts
    - mmc: renesas_sdhi_internal_dmac: fix #define RST_RESERVED_BITS
    - readahead: stricter check for bdi io_pages
    - block: fix infinite loop if the device loses discard capability
    - block: blk_init_allocated_queue() set q->fq as NULL in the fail case
    - block: really disable runtime-pm for blk-mq
    - blkcg: Introduce blkg_root_lookup()
    - block: Introduce blk_exit_queue()
    - block: Ensure that a request queue is dissociated from the cgroup controller
    - apparmor: fix bad debug check in apparmor_secid_to_secctx()
    - dma-buf: Move BUG_ON from _add_shared_fence to _add_shared_inplace
    - libertas: fix suspend and resume for SDIO connected cards
    - media: Revert "[media] tvp5150: fix pad format frame height"
    - mailbox: xgene-slimpro: Fix potential NULL pointer dereference
    - Replace magic for trusting the secondary keyring with #define
    - Fix kexec forbidding kernels signed with keys in the secondary keyring to
      boot
    - powerpc/fadump: handle crash memory ranges array index overflow
    - powerpc/64s: Fix page table fragment refcount race vs speculative references
    - powerpc/pseries: Fix endianness while restoring of r3 in MCE handler.
    - powerpc/pkeys: Give all threads control of their key permissions
    - powerpc/pkeys: Deny read/write/execute by default
    - powerpc/pkeys: key allocation/deallocation must not change pkey registers
    - powerpc/pkeys: Save the pkey registers before fork
    - powerpc/pkeys: Fix calculation of total pkeys.
    - powerpc/pkeys: Preallocate execute-only key
    - powerpc/nohash: fix pte_access_permitted()
    - powerpc64/ftrace: Include ftrace.h needed for enable/disable calls
    - powerpc/powernv/pci: Work around races in PCI bridge enabling
    - cxl: Fix wrong comparison in cxl_adapter_context_get()
    - IB/mlx5: Honor cnt_set_id_valid flag instead of set_id
    - IB/mlx5: Fix leaking stack memory to userspace
    - IB/srpt: Fix srpt_cm_req_recv() error path (1/2)
    - IB/srpt: Fix srpt_cm_req_recv() error path (2/2)
    - IB/srpt: Support HCAs with more than two ports
    - overflow.h: Add arithmetic shift helper
    - RDMA/mlx5: Fix shift overflow in mlx5_ib_create_wq
    - ib_srpt: Fix a use-after-free in srpt_close_ch()
    - ib_srpt: Fix a use-after-free in __srpt_close_all_ch()
    - RDMA/rxe: Set wqe->status correctly if an unexpected...

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
bugproxy (bugproxy)
tags: added: targetmilestone-inin18042
removed: targetmilestone-inin---
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.