Undetected Data corruption in MPI workloads that use VSX for reductions on POWER9 DD2.1 systems

Bug #1902694 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
High
Patricia Domingues
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned
Groovy
Fix Released
Undecided
Unassigned
Hirsute
Fix Released
Undecided
Unassigned

Bug Description

SRU Justification:

[Impact]

* A data integrity issue was observed on POWER 9 (DD2.1) systems.

* It affects Ubuntu 20.04 with kernel 5.4.0-52 and Ubuntu 20.10 with kernel 5.8.0-26 kernel.

* The root cause is found in the compiling of p9_hmi_special_emu().

* When doing a VMX store (in __get_user_atomic_128_aligned()) to a buffer (vbuf), the buffer is not 128 bit aligned.

[Fix]

* 1da4a0272c54 "powerpc: Fix undetected data corruption with P9N DD2.1 VSX CI load emulation"

* d1781f237047 "selftests/powerpc: Make alignment handler test P9N DD2.1 vector CI load workaround"

[Test Case]

* A POWER 9 (DD2.1) bare metal system is needed that has either Ubuntu 20.04, 20.10 or 21.04 installed.

* It's best to test this based on a sample application and test case
  "selftests/powerpc: Make alignment handler test P9N DD2.1 vector CI load workaround"

[Regression Potential]

* The regression risk is relatively moderate, because:

* it only happens with special VSX (vector) instructions in use, e.g. in p9_hmi_special_emu

* it happens on bare metal only and only on POWER 9 (DD2.1)

* and the changes are very overseeable (in total one effective code line per patch/commit)

* Since only p9_hmi_special_emu is touched, this will break in case of any regressions, but this is already broken based on this bug.

[Other]

* According to the reporter this affects Ubuntu 20.04 / 5.4.0-52 and 20.10 / 5.8.0-26.

* Since the development of Hirsute is already open the SRU is requested for Hirsute, too.

* Patches got upstream accepted in v5.10-rc1 and v5.10-rc2.

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-189000 severity-medium targetmilestone-inin2010
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Frank Heimes (fheimes) wrote :

Even if this was submitted as medium, I bump it to high.

Changed in ubuntu-power-systems:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
importance: Undecided → High
Revision history for this message
Patricia Domingues (patriciasd) wrote :

Hi Waiki, thanks for raising this bug.
Please, could you share a test scenario for the verification of these patches - how it can be tested? This is needed for every kernel SRU. Thanks in advance.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-11-03 13:56 EDT-------
Hello Patricia,
We have this script(mpe.py) which will detect if a vmlinux has the alignment issue or not. We ran it against 16.04, 18.04, 20.04 and 21.04 kernels, and we only found the misalignment in 20.04 and 21.04. 16.04's kernel doesn't have the p9_hmi_special_emu().

16.04:
user@ltc-zz9:/tmp/test/boot$ ~/mpe.py vmlinux-4.4.0-193-generic System.map-4.4.0-193-generic
Couldn't find p9_hmi_special_emu in objdump output
Error: couldn't find stvx!
18.04:
user@ltc-zz9:/tmp/test/boot$ ~/mpe.py vmlinux-4.15.0-122-generic System.map-4.15.0-122-generic
Couldn't find p9_hmi_special_emu in objdump output
stvx found using register r25:
c00000000002988c: ce c9 00 7c stvx v0,0,r25
addi found using offset 32:
c000000000029884: 20 00 21 3b addi r25,r1,32
OK - offset is aligned
20.04:
user@ltc-zz9:/tmp/test/boot$ ~/mpe.py vmlinux-5.4.0-52-generic System.map-5.4.0-52-generic
stvx found using register r28:
c00000000002cbec: ce e1 00 7c stvx v0,0,r28
addi found using offset 40:
c00000000002cbe4: 28 00 81 3b addi r28,r1,40
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Offset is misaligned - bug present !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20.10:
user@ltc-zz9:/tmp/test/boot$ ~/mpe.py vmlinux-5.8.0-26-generic System.map-5.8.0-26-generic
stvx found using register r9:
c000000000025a78: ce 49 00 7c stvx v0,0,r9
addi found using offset 40:
c000000000025a70: 28 00 21 39 addi r9,r1,40
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Offset is misaligned - bug present !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

I will attach the mpe.py script.

Revision history for this message
bugproxy (bugproxy) wrote : mpe.py - test script for vmlinux misalignment

------- Comment (attachment only) From <email address hidden> 2020-11-03 13:57 EDT-------

bugproxy (bugproxy)
tags: added: severity-high
removed: severity-medium
description: updated
summary: - Ubuntu 20.10- Undetected Data corruption in MPI workloads that use VSX
- for reductions on POWER9 DD2.1 systems
+ Undetected Data corruption in MPI workloads that use VSX for reductions
+ on POWER9 DD2.1 systems
Revision history for this message
Frank Heimes (fheimes) wrote :
Changed in linux (Ubuntu Hirsute):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → nobody
Changed in linux (Ubuntu Focal):
status: New → In Progress
Changed in linux (Ubuntu Groovy):
status: New → In Progress
Changed in linux (Ubuntu Hirsute):
status: New → In Progress
Changed in ubuntu-power-systems:
status: New → In Progress
Changed in ubuntu-power-systems:
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Patricia Domingues (patriciasd)
Ian May (ian-may)
Changed in linux (Ubuntu Groovy):
status: In Progress → Fix Committed
Ian May (ian-may)
Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
tags: added: verification-needed-groovy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-groovy' to 'verification-done-groovy'. If the problem still exists, change the tag 'verification-needed-groovy' to 'verification-failed-groovy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-11-17 11:58 EDT-------
Verified for focal and groovy:

20.04:
waiki@ltc-wspoon3:~/DI/code/boot$ ~/mpe.py vmlinux-5.4.0-55-generic System.map-5.4.0-55-generic
stvx found using register r28:
c00000000002cbfc: ce e1 00 7c stvx v0,0,r28

addi found using offset 32:
c00000000002cbf4: 20 00 81 3b addi r28,r1,32

OK - offset is aligned

20.10:
waiki@ltc-wspoon3:~/DI/code/boot$ ~/mpe.py vmlinux-5.8.0-30-generic System.map-5.8.0-30-generic
stvx found using register r9:
c000000000025a78: ce 49 00 7c stvx v0,0,r9

addi found using offset 32:
c000000000025a70: 20 00 21 39 addi r9,r1,32

OK - offset is aligned

tags: added: verification-done-focal verification-done-groovy
removed: verification-needed-focal verification-needed-groovy
Revision history for this message
Patricia Domingues (patriciasd) wrote :

Thanks Waiki!

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (78.9 KiB)

This bug was fixed in the package linux - 5.4.0-56.62

---------------
linux (5.4.0-56.62) focal; urgency=medium

  * focal/linux: 5.4.0-56.62 -proposed tracker (LP: #1905300)

  * CVE-2020-4788
    - selftests/powerpc: rfi_flush: disable entry flush if present
    - powerpc/64s: flush L1D on kernel entry
    - powerpc/64s: flush L1D after user accesses
    - selftests/powerpc: entry flush test

linux (5.4.0-55.61) focal; urgency=medium

  * focal/linux: 5.4.0-55.61 -proposed tracker (LP: #1903175)

  * Update kernel packaging to support forward porting kernels (LP: #1902957)
    - [Debian] Update for leader included in BACKPORT_SUFFIX

  * Avoid double newline when running insertchanges (LP: #1903293)
    - [Packaging] insertchanges: avoid double newline

  * EFI: Fails when BootCurrent entry does not exist (LP: #1899993)
    - efivarfs: Replace invalid slashes with exclamation marks in dentries.

  * CVE-2020-14351
    - perf/core: Fix race in the perf_mmap_close() function

  * raid10: Block discard is very slow, causing severe delays for mkfs and
    fstrim operations (LP: #1896578)
    - md: add md_submit_discard_bio() for submitting discard bio
    - md/raid10: extend r10bio devs to raid disks
    - md/raid10: pull codes that wait for blocked dev into one function
    - md/raid10: improve raid10 discard request
    - md/raid10: improve discard request for far layout
    - dm raid: fix discard limits for raid1 and raid10
    - dm raid: remove unnecessary discard limits for raid10

  * Bionic: btrfs: kernel BUG at /build/linux-
    eTBZpZ/linux-4.15.0/fs/btrfs/ctree.c:3233! (LP: #1902254)
    - btrfs: drop unnecessary offset_in_page in extent buffer helpers
    - btrfs: extent_io: do extra check for extent buffer read write functions
    - btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()
    - btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref()
    - btrfs: ctree: check key order before merging tree blocks

  * Ethernet no link lights after reboot (Intel i225-v 2.5G) (LP: #1902578)
    - igc: Add PHY power management control

  * Undetected Data corruption in MPI workloads that use VSX for reductions on
    POWER9 DD2.1 systems (LP: #1902694)
    - powerpc: Fix undetected data corruption with P9N DD2.1 VSX CI load emulation
    - selftests/powerpc: Make alignment handler test P9N DD2.1 vector CI load
      workaround

  * [20.04 FEAT] Support/enhancement of NVMe IPL (LP: #1902179)
    - s390: nvme ipl
    - s390: nvme reipl
    - s390/ipl: support NVMe IPL kernel parameters

  * uvcvideo: add mapping for HEVC payloads (LP: #1895803)
    - media: uvcvideo: Add mapping for HEVC payloads

  * Focal update: v5.4.73 upstream stable release (LP: #1902115)
    - ibmveth: Switch order of ibmveth_helper calls.
    - ibmveth: Identify ingress large send packets.
    - ipv4: Restore flowi4_oif update before call to xfrm_lookup_route
    - mlx4: handle non-napi callers to napi_poll
    - net: fec: Fix phy_device lookup for phy_reset_after_clk_enable()
    - net: fec: Fix PHY init after phy_reset_after_clk_enable()
    - net: fix pos incrementment in ipv6_route_seq_next
    - net/smc: fix valid DMBE buffer sizes
    - net...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (50.5 KiB)

This bug was fixed in the package linux - 5.8.0-31.33

---------------
linux (5.8.0-31.33) groovy; urgency=medium

  * groovy/linux: 5.8.0-31.33 -proposed tracker (LP: #1905299)

  * Groovy 5.8 kernel hangs on boot on CPUs with eLLC (LP: #1903397)
    - drm/i915: Mark ininitial fb obj as WT on eLLC machines to avoid rcu lockup
      during fbdev init

  * CVE-2020-4788
    - selftests/powerpc: rfi_flush: disable entry flush if present
    - powerpc/64s: flush L1D on kernel entry
    - powerpc/64s: flush L1D after user accesses
    - selftests/powerpc: entry flush test

linux (5.8.0-30.32) groovy; urgency=medium

  * groovy/linux: 5.8.0-30.32 -proposed tracker (LP: #1903194)

  * Update kernel packaging to support forward porting kernels (LP: #1902957)
    - [Debian] Update for leader included in BACKPORT_SUFFIX

  * Avoid double newline when running insertchanges (LP: #1903293)
    - [Packaging] insertchanges: avoid double newline

  * EFI: Fails when BootCurrent entry does not exist (LP: #1899993)
    - efivarfs: Replace invalid slashes with exclamation marks in dentries.

  * raid10: Block discard is very slow, causing severe delays for mkfs and
    fstrim operations (LP: #1896578)
    - md: add md_submit_discard_bio() for submitting discard bio
    - md/raid10: extend r10bio devs to raid disks
    - md/raid10: pull codes that wait for blocked dev into one function
    - md/raid10: improve raid10 discard request
    - md/raid10: improve discard request for far layout
    - dm raid: fix discard limits for raid1 and raid10
    - dm raid: remove unnecessary discard limits for raid10

  * Bionic: btrfs: kernel BUG at /build/linux-
    eTBZpZ/linux-4.15.0/fs/btrfs/ctree.c:3233! (LP: #1902254)
    - btrfs: extent_io: do extra check for extent buffer read write functions
    - btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()
    - btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref()
    - btrfs: ctree: check key order before merging tree blocks

  * Tiger Lake PMC core driver fixes (LP: #1899883)
    - platform/x86: intel_pmc_core: update TGL's LPM0 reg bit map name
    - platform/x86: intel_pmc_core: fix bound check in pmc_core_mphy_pg_show()
    - platform/x86: pmc_core: Use descriptive names for LPM registers
    - platform/x86: intel_pmc_core: Fix TigerLake power gating status map
    - platform/x86: intel_pmc_core: Fix the slp_s0 counter displayed value

  * drm/i915/dp_mst - System would hang during the boot up. (LP: #1902469)
    - Revert "UBUNTU: SAUCE: drm/i915/display: Fix null deref in
      intel_psr_atomic_check()"
    - drm/i915: Fix encoder lookup during PSR atomic check

  * Undetected Data corruption in MPI workloads that use VSX for reductions on
    POWER9 DD2.1 systems (LP: #1902694)
    - powerpc: Fix undetected data corruption with P9N DD2.1 VSX CI load emulation
    - selftests/powerpc: Make alignment handler test P9N DD2.1 vector CI load
      workaround

  * [20.04 FEAT] Support/enhancement of NVMe IPL (LP: #1902179)
    - s390/ipl: support NVMe IPL kernel parameters

  * uvcvideo: add mapping for HEVC payloads (LP: #1895803)
    - media: uvcvideo: Add mapping for HEVC payloads

  * risc-v 5.8 ...

Changed in linux (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.8.0-36.40+21.04.1

---------------
linux (5.8.0-36.40+21.04.1) hirsute; urgency=medium

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  [ Ubuntu: 5.8.0-36.40 ]

  * debian/scripts/file-downloader does not handle positive failures correctly
    (LP: #1878897)
    - [Packaging] file-downloader not handling positive failures correctly

  [ Ubuntu: 5.8.0-35.39 ]

  * Packaging resync (LP: #1786013)
    - update dkms package versions
  * CVE-2021-1052 // CVE-2021-1053
    - [Packaging] NVIDIA -- Add the NVIDIA 460 driver

 -- Kleber Sacilotto de Souza <email address hidden> Thu, 07 Jan 2021 11:57:30 +0100

Changed in linux (Ubuntu Hirsute):
status: In Progress → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.