powerpc/64s: Add workaround for P9 vector CI load issue

Bug #1721070 reported by bugproxy on 2017-10-03
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Critical
Canonical Kernel Team
linux (Ubuntu)
Critical
Joseph Salisbury
Zesty
Critical
Joseph Salisbury
Artful
Critical
Joseph Salisbury

Bug Description

== SRU Justification ==
POWER9 DD2.1 and earlier has an issue where some cache inhibited
vector load will return bad data. The fix is two part, one
firmware/microcode part triggers HMI interrupts when hitting such
loads, the other part is commit 5080332c2c89 from linux-next which then
emulates the instructions in Linux.

The affected instructions are limited to lxvd2x, lxvw4x, lxvb16x and
lxvh8x.

Commit ccd3cd361341 is needed as a prereq for Artful.
Commits a3d96f70c147 and ccd3cd361341 are needed as prereqs for Zesty.

== Fixes ==
a3d96f70c147 ("powerpc/64s: Fix system reset vs general interrupt reentrancy")
ccd3cd361341 ("powerpc/mce: Move 64-bit machine check code into mce.c")
5080332c2c89 ("powerpc/64s: Add workaround for P9 vector CI load issue")

== Regression Potential ==
These commits are specific to powerpc. They required some back porting but
 have been tested by IBM.

-- Problem Description --

When an instruction triggers the HMI, all threads in the core will be
sent to the HMI handler, not just the one running the vector load.

In general, these spurious HMIs are detected by the emulation code and
we just return back to the running process. Unfortunately, if a
spurious interrupt occurs on a vector load that's to normal memory we
have no way to detect that it's spurious (unless we walk the page
tables, which is very expensive). In this case we emulate the load but
we need do so using a vector load itself to ensure 128bit atomicity is
preserved.

Some additional debugfs emulated instruction counters are added also.

In order to solve this bug, we need to cherry pick the following patch

https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=5080332c2c893118dbc18755f35c8b0131cf0fc4

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-159632 severity-critical targetmilestone-inin1710

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
Changed in linux (Ubuntu Zesty):
status: New → Triaged
importance: Undecided → Medium
tags: added: kernel-da-key
Changed in ubuntu-power-systems:
importance: Undecided → Critical
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu Zesty):
importance: Medium → Critical
Changed in linux (Ubuntu Artful):
importance: Medium → Critical
Frank Heimes (fheimes) on 2017-10-03
Changed in ubuntu-power-systems:
status: New → Triaged
no longer affects: linux (Ubuntu Zesty)

I built a 17.10(Artful) test kernel with a pick of commit 5080332c2c89. It required commit ccd3cd361 as a prerequisite. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1721070/artful/

Can you test this kernel and see if it resolves this bug?

Also, is this patch needed in 17.04/16.04.3(4.10 kernels)? If so, we will need to identify all the needed prereq commits.

Changed in linux (Ubuntu Artful):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
status: Triaged → In Progress
Frank Heimes (fheimes) on 2017-10-04
Changed in ubuntu-power-systems:
status: Triaged → In Progress
Breno Leitão (breno-leitao) wrote :

Joseph,

Do you have a kernel source for this package that I can take a deeper look?

Joseph Salisbury (jsalisbury) wrote :

The source code for the kernel posted in comment #3 is the ubuntu-artful repo with commits ccd3cd361 and 5080332c2c89 on top.

The artful repo is available at:
git://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/artful

Both commits were clean picks and did not need back porting.

I also tar'd up the source I used and put it here as the file name lp1721070-source.tar:
http://kernel.ubuntu.com/~jsalisbury/lp1721070/

Breno Leitão (breno-leitao) wrote :

Joseph,

Which branch did you use? I can't find the commits at master or master-next:

[root@ltc-wspoon9 ubuntu-artful]# git log --oneline js-master | head -n 20
1e3dad9 UBUNTU: Ubuntu-4.13.0-12.13
1f46464 UBUNTU: [Config] CONFIG_I2C_XLP9XX=m
39ae1ca perf/x86: Fix data source decoding for Skylake
f473bf9 perf/x86: Move Nehalem PEBS code to flag
a8725d5 UBUNTU: [Config] CONFIG_DRM_VBOXVIDEO=n
30d9497 UBUNTU: [Config] Disable CONFIG_IPMMU_VMSA on arm64
806240e scsi: cxlflash: Fix vlun resize failure in the shrink path
69f4910 fs: aio: fix the increment of aio-nr and counting against aio-max-nr
675f3cd UBUNTU: [Config] CONFIG_PINCTRL_DENVERTON=m
ebaccf7 libnvdimm, btt: rework error clearing
4bda159 libnvdimm: fix potential deadlock while clearing errors
adfc036 libnvdimm, btt: cache sector_size in arena_info
b79a931 libnvdimm, btt: ensure that flags were also unchanged during a map_read
45f43c2 libnvdimm, btt: refactor map entry operations with macros
36a7c4e libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
3f6cafd dax: move all DAX radix tree defs to fs/dax.c
52522e9 dax: remove DAX code from page_cache_tree_insert()
54585e3 dax: use common 4k zero page for dax mmap reads
7b6956d dax: relocate some dax functions
a9342b2 mm: add vm_insert_mixed_mkwrite()

[root@ltc-wspoon9 ubuntu-artful]# git log --oneline js-master-next | head -n 20
f30326f4 UBUNTU: Start new release
ba39ad1 UBUNTU: Ubuntu-4.13.0-15.16
655913c Revert "powerpc/powernv: Add IMC OPAL APIs"
4bab524 Revert "powerpc/powernv: Detect and create IMC device"
192a8d2 Revert "powerpc/perf: Add nest IMC PMU support"
1b9f175 Revert "powerpc/powernv: Add support for powercap framework"
386a56a Revert "powerpc/powernv: Add support to set power-shifting-ratio"
8dc96fd Revert "powerpc/powernv: Enable PCI peer-to-peer"
fdea4c4 Revert "powerpc/powernv/vas: Define macros, register fields and structures"
258fa77 Revert "powerpc/powernv: Move GET_FIELD/SET_FIELD to vas.h"
efb0381 Revert "powerpc/powernv/vas: Define vas_init() and vas_exit()"
18c0c36 Revert "powerpc/powernv/vas: Define helpers to access MMIO regions"
b717098 Revert "powerpc/powernv/vas: Define helpers to init window context"
cef2434 Revert "powerpc/powernv/vas: Define helpers to alloc/free windows"
8d6ed22 Revert "powerpc/powernv/vas: Define vas_rx_win_open() interface"
15b835b Revert "powerpc/powernv/vas: Define vas_win_close() interface"
47073e3 Revert "powerpc/powernv/vas: Define vas_tx_win_open()"
5afb9ea Revert "powerpc/powernv/vas: Define copy/paste interfaces"
9cf4ce8 Revert "UBUNTU: [Config] CONFIG_PPC_VAS=y"
b22a1be Revert "crypto/nx: Rename nx842_powernv_function as icswx function"

Joseph Salisbury (jsalisbury) wrote :

The commits are in the master branch:

0f5d387 powerpc/64s: Add workaround for P9 vector CI load issue
d401742 powerpc/mce: Move 64-bit machine check code into mce.c
1e3dad9 UBUNTU: Ubuntu-4.13.0-12.13
1f46464 UBUNTU: [Config] CONFIG_I2C_XLP9XX=m

I'll check to ensure I tar'd up the right tree.

Changed in linux (Ubuntu Zesty):
status: New → Triaged
importance: Undecided → Critical
Joseph Salisbury (jsalisbury) wrote :

I do see them in that tar file. Can you run:

tar -xvf lp1721070-source.tar
cd ubuntu-artful/
git log --oneline

Joseph Salisbury (jsalisbury) wrote :

I also built a 4.10 based test kernel(16.04.3). It required commit ccd3cd361 and a3d96f70c1 as prerequisites. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1721070/zesty

Can you test this kernel and see if it resolves this bug?

Breno Leitão (breno-leitao) wrote :

Joseph,

Quick update on this bug.

I tested your artful test and:

a) It didn't cause any regression. I tested on two P9 with different processor level and P8 and the code runs fine.

b) It seems to be fixing the issue that is proposed to fix.

c) There are some test cases that are still failing, and might require extra and we will might request those in another bug.

Hi @jsalisbury

> I also built a 4.10 based test kernel(16.04.3). [snip]

Thanks.

> Can you test this kernel and see if it resolves this bug?

Yes, it does.

I have verified both the patched one (4.10.0-35.39~lp1721070), which passes,
and its original version (4.10.0-35.39), which fails; so the patches are OK.

Similarly to @breno-leitao, the main problem is resolved (and that's really good enough for now), and a few test-cases report failures, which can likely be resolved with some other apparently unrelated commits which we are digging, and would like to submit later on.

cheers,
Mauricio

Changed in linux (Ubuntu Zesty):
status: Triaged → In Progress
assignee: nobody → Joseph Salisbury (jsalisbury)
description: updated

Joseph,

Again, thanks.

I had a few suggestions on the submitted backports, and observed a subtle problem in one of them.
If you prefer, I can submit the ones I have here, if that saves you some cycles.

cheers,
Mauricio

Joseph Salisbury (jsalisbury) wrote :

Thanks for the update, Mauricio.

It would be great if you could re-submit the backports you have.

Manoj Iyer (manjo) on 2017-10-16
tags: added: triage-g

------- Comment From <email address hidden> 2017-10-18 16:20 EDT-------
Joseph, Mauricio,

We would like to have this fix in the next SRU. This is causing a lot of impact on 16.04 HWE kernel users on POWER9.

Joseph, what is the last date we should submit this fix for the next SRU?

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-10-19 08:00 EDT-------
Breno,

(In reply to comment #22)
> We would like to have this fix in the next SRU. This is causing a lot of
> impact on 16.04 HWE kernel users on POWER9.

Sure. I am waiting for test feedback to proceed with the submission.

summary: - powerpc/64s: Add workaround for P9 vector CI load issuenext
+ powerpc/64s: Add workaround for P9 vector CI load issue
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-10-26 11:56 EDT-------
*** Bug 159680 has been marked as a duplicate of this bug. ***

bugproxy (bugproxy) on 2017-10-26
tags: added: targetmilestone-inin16043
removed: targetmilestone-inin1710
Changed in linux (Ubuntu Zesty):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Manoj Iyer (manjo) on 2017-11-06
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-zesty' to 'verification-done-zesty'. If the problem still exists, change the tag 'verification-needed-zesty' to 'verification-failed-zesty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-zesty
tags: added: verification-needed-artful

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Hello IBM,

Could you please verify the fix with the Zesty and Artful kernels currently in proposed?

Thank you,
Kleber

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-11-14 11:58 EDT-------
Status Today:
Tested on Artful
- Aligntest : OK
- Stress-ng : OK, 30 minutes of stress of CPUs was fine.

Missing Zesty environment.

Po-Hsu Lin (cypressyew) on 2017-11-15
tags: added: verification-done-artful
removed: verification-needed-artful
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-11-15 16:50 EDT-------
The test on Zesty worked fine with the proposed kernel 4.10.0-40-generic.

Both releases worked with aligntest for CI Vector Load and stress-ng focused on CPU.

tags: added: verification-done-zesty
removed: verification-needed-zesty
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.10.0-40.44

---------------
linux (4.10.0-40.44) zesty; urgency=low

  * linux: 4.10.0-40.44 -proposed tracker (LP: #1731269)

  * s390/mm: fix write access check in gup_huge_pmd() (LP: #1730596)
    - s390/mm: fix write access check in gup_huge_pmd()

 -- Kleber Sacilotto de Souza <email address hidden> Thu, 09 Nov 2017 15:24:07 +0100

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (21.9 KiB)

This bug was fixed in the package linux - 4.13.0-17.20

---------------
linux (4.13.0-17.20) artful; urgency=low

  * linux: 4.13.0-17.20 -proposed tracker (LP: #1728927)

  [ Seth Forshee ]
  * thunderx2 ahci errata workaround needs additional delays (LP: #1724117)
    - SAUCE: ahci: thunderx2: stop engine fix update

  * usb 3-1: 2:1: cannot get freq at ep 0x1 (LP: #1708499)
    - ALSA: usb-audio: Add sample rate quirk for Plantronics C310/C520-M

  * Plantronics Blackwire C520-M - Cannot get freq at ep 0x1, 0x81
    (LP: #1709282)
    - ALSA: usb-audio: Add sample rate quirk for Plantronics C310/C520-M

  * TSC_DEADLINE incorrectly disabled inside virtual guests (LP: #1724912)
    - x86/apic: Silence "FW_BUG TSC_DEADLINE disabled due to Errata" on CPUs
      without the feature
    - x86/apic: Silence "FW_BUG TSC_DEADLINE disabled due to Errata" on
      hypervisors

  * x86/apic: Update TSC_DEADLINE quirk with additional SKX stepping
    (LP: #1724612)
    - x86/apic: Update TSC_DEADLINE quirk with additional SKX stepping

  * [Artful] Add support for Dell/Wyse 3040 audio codec (LP: #1723916)
    - SAUCE: ASoC: rt5670: Add support for Wyse 3040

  * [Artful] Some Dell Monitors Doesn't Work Well with Dell/Wyse 3040
    (LP: #1723915)
    - SAUCE: drm/i915: Workaround for DP DPMS D3 on Dell monitor

  * [Artful] Support headset mode for DELL WYSE (LP: #1723913)
    - SAUCE: ALSA: hda/realtek - Add support headset mode for DELL WYSE

  * Touchpad and TrackPoint Dose Not Work on Lenovo X1C6 and X280 (LP: #1723986)
    - SAUCE: Input: synaptics-rmi4 - RMI4 can also use SMBUS version 3
    - SAUCE: Input: synaptics - Lenovo X1 Carbon 5 should use SMBUS/RMI
    - SAUCE: Input: synaptics - add Intertouch support on X1 Carbon 6th and X280

  * Artful update to v4.13.8 stable release (LP: #1724669)
    - USB: dummy-hcd: Fix deadlock caused by disconnect detection
    - MIPS: math-emu: Remove pr_err() calls from fpu_emu()
    - MIPS: bpf: Fix uninitialised target compiler error
    - mei: always use domain runtime pm callbacks.
    - dmaengine: edma: Align the memcpy acnt array size with the transfer
    - dmaengine: ti-dma-crossbar: Fix possible race condition with dma_inuse
    - NFS: Fix uninitialized rpc_wait_queue
    - nfs/filelayout: fix oops when freeing filelayout segment
    - HID: usbhid: fix out-of-bounds bug
    - crypto: skcipher - Fix crash on zero-length input
    - crypto: shash - Fix zero-length shash ahash digest crash
    - KVM: MMU: always terminate page walks at level 1
    - KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exit
    - usb: renesas_usbhs: Fix DMAC sequence for receiving zero-length packet
    - pinctrl/amd: Fix build dependency on pinmux code
    - iommu/amd: Finish TLB flush in amd_iommu_unmap()
    - device property: Track owner device of device property
    - Revert "vmalloc: back off when the current task is killed"
    - fs/mpage.c: fix mpage_writepage() for pages with buffers
    - ALSA: usb-audio: Kill stray URB at exiting
    - ALSA: seq: Fix use-after-free at creating a port
    - ALSA: seq: Fix copy_from_user() call inside lock
    - ALSA: caiaq: Fix stray URB at probe error path
    - ALSA: li...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
Manoj Iyer (manjo) on 2017-11-28
Changed in linux (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers